allenai / cartography

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Apache License 2.0
188 stars 63 forks source link

Work With Other Datasets #4

Open antmarakis opened 3 years ago

antmarakis commented 3 years ago

Hi! This looks like a very interesting tool, I am wondering if it would be easy to use on other datasets. I see only GLUE/NLI datasets are supported. Do you have any tips on how to use this on a simple {TEXT, LABEL} task? Thanks!

douglashiwo commented 2 years ago

I have the same question with antmarakis. Can you kindly help?

lukasmoldon commented 1 year ago

Just sharing my experience with this repo, maybe this helps someone in the future:

I used this repo for a {TEXT, LABEL} task with BERT models. Since neither this type of task nor this type of model is supported in the training section of this repo, I would recommend to first train any model on any dataset on your own (without using the code of this repo). While training, save somewhere the logits of each data instance together with the gold standard label and a unique identifier, as suggested by the authors (see "Note:" section).

After training you can use train_dy_filtering, as explained here to generate DataMaps and to obtain coordinates for further data filtering. You just need to extend this line of code by any additional name, which you use from now on as task name. Then you can call python -m cartography.selection.train_dy_filtering --plot --task_name "YOUR_NEW_TASK_NAME" --model "ANY_NAME_YOU_WANT" --model_dir "" from the main directory of this repo to create the DataMap. Make sure to create a training_dynamics folder in the main directory including your training dynamics. The plot will be automatically saved in the cartography folder, while the coordinates will be stored to the main directory (this can be changed by other arguments like --plots_dir).

pskadasi commented 1 year ago

fixing errors of this repo is more time consuming than extracting training dynamics from the model that is trained independently :)