bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
455 stars 116 forks source link

chemdner kb implementation needs normalizations #683

Closed galtay closed 1 year ago

galtay commented 2 years ago

https://github.com/bigscience-workshop/biomedical/blob/master/bigbio/biodatasets/chemdner/chemdner.py https://github.com/bigscience-workshop/biomedical/pull/326

the current implementation says it supports the text classification and named entity recognition tasks. the text classification tasks has MESH codes but the NER task does not. this issue is to investigate why the MESH codes are not available in the normlized field of the kb entity schema and to investigate if we can make this a named entity disambiguation task as well.

sg-wbi commented 2 years ago

the text classification tasks has MESH codes but the NER task does not.

This is because the MeSH codes are assigned as "document" (global) tags and are used for indexing purpose in PubMed. Unfortunately no annotation is provide at the mention-level. This is why it is a NER and TEXT_CLASSIFICATION dataset.