allenai / cartography

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Apache License 2.0
188 stars 63 forks source link

dump the training dynamics to json #1

Closed terarachang closed 3 years ago

terarachang commented 3 years ago

Hi,

Thanks for the nice work!

I have a question about L149: https://github.com/allenai/cartography/blob/c7865383e421a91611c2f4e79d1ffbfb7850f4f4/cartography/selection/train_dy_filtering.py#L149

I don't understand why you are enumerating over correctness_. I might misunderstand something, but I think you should iterate over all the guids instead. Otherwise, you cannot dump the statistics of the entire training set as guid in this loop only has 1 + Epoch possible values.

  df = pd.DataFrame([[guid,
                      i,
                      threshold_closeness_[guid],
                      confidence_[guid],
                      variability_[guid],
                      correctness_[guid],
                      forgetfulness_[guid],
                      ] for i, guid in enumerate(correctness_)], columns=column_names)

Thank you!

swabhs commented 3 years ago

correctness_ is a dictionary mapping each guid to the correctness metric values (a list of size 1+ Epoch), so in effect this loop does iterate over all the guids. Hope this clears the confusion!