allenai / s2_fos

Apache License 2.0
32 stars 2 forks source link

Question about dataset used for training the model #16

Closed narayanacharya6 closed 1 year ago

narayanacharya6 commented 1 year ago

I noticed that the model published with the repo outputs the same labels as the labels in the fos subset in the SciRepEval dataset published here. Can someone comment if the model was trained using some version (same/subset/superset?) of this dataset?

There is another issue that asked about the training data for the model that was closed. This question is only out of curiosity, so feel free to close this issue too if the training data or details on how it was curated cannot be made public yet.

sergeyf commented 1 year ago

Hi there.

Yes, when we made SciRepEval, we used the FoS training data in the train split and the gold manual eval in the eval split. Due to some internal chaos at the time of training the original FoS model (and a lack of a publication for it), I am 95% sure the training data is the same (and not 100%).

We are working on a better transformer-based FoS model and the training set there will be silver data labeled by various GPTs. Hopefully that will be released later this year.

narayanacharya6 commented 1 year ago

Thanks for clarifying! Looking forward to the new models and dataset :)