bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
439 stars 111 forks source link

Closes #919 (Flambe) #920

Closed raissinging closed 1 month ago

raissinging commented 1 month ago

closes #919

Checkbox

raissinging commented 1 month ago

Hi, thanks for contributing this to BigBio!

It seems like the number of documents reported by the unit tests do not match those reported in the source paper. Is this expected?

train
==========
id: 38
document_id: 38
text: 38
labels: 289394

test
==========
id: 11
document_id: 11
text: 11
labels: 79946

validation
==========
id: 6
document_id: 6
text: 6
labels: 50609

Hi! Thank you for letting me know. I originally only had the 55 full text papers using a bigbio schema, but I just added the 1,195 paper abstracts we have as well! Sorry about that!