bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Revise implementation of BioRED #853

Closed mariosaenger closed 1 year ago

mariosaenger commented 1 year ago

This PR improves the implementation of the BioRed corpus:

galtay commented 1 year ago

@davidkartchner can you take a look here and comment on the database names? is there a way we can make these database names consistent with the ones you are using?

@mariosaenger is this PR sensitive to the exact db name? i.e. would it break anything if we use MESH instead of "ChemicalEntity": "Medical Subject Headings (MESH)"

davidkartchner commented 1 year ago

@galtay @mariosaenger In order to be consistent with other datasets, I would use the following mapping from type to database:

TYPE_TO_DATABASE = {
        "CellLine": "Cellosaurus",
        "ChemicalEntity": "MESH",
        "DiseaseOrPhenotypicFeature": "MESH" or "OMIM",
        "GeneOrGeneProduct": "NCBIGene",
        "OrganismTaxon": "NCBITaxon",
        "SequenceVariant": "dbSNP" or "custom",
    }

For cases where an entity type can be linked to multiple databases, it is especially important to correctly specify which database the identifier is coming from to effectively train an entity normalization model later on. For DiseaseOrPhenotypicFeature or SequenceVariant, you can probably determine which database it links to with some basic checks on the string format (e.g. all OMIM normalization have "OMIM" prepended to their identifier). An example can be found at https://huggingface.co/datasets/bigbio/ncbi_disease/blob/main/ncbi_disease.py#L233

mariosaenger commented 1 year ago

@galtay @davidkartchner thanks for the feedback. I revised the implementation to adhere to the database naming scheme.

In a future PR, we could possibly also consider standardising the naming scheme of all NEN data sets via constants.

galtay commented 1 year ago

thanks @mariosaenger . yes I agree, it would be nice to have a check for standardized database names in the unit tests.