bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
439 stars 111 forks source link

Add implementation of ChemDisGene data set #918

Closed mariosaenger closed 1 month ago

mariosaenger commented 1 month ago

Closes #917

mariosaenger commented 1 month ago

Thanks for checking the implementation. There are several aspects to keep in mind. First the dataset consists of a curated and a non-curated part. This implementation only concerns the former one. Second, the data set annotates relations only on abstract-level (using knowledge base identifiers). Following default practices in BigBio, I unrolled the document-level relations to mention-level. Note, however, the document-level annotations are available in the source schema. These aspects complicate a direct comparison of the numbers :-/

leonweber commented 1 month ago

Ah, thanks for pointing this out. Then let's merge this : )