Revise implementation of BioRED

mariosaenger commented 1 year ago

This PR improves the implementation of the BioRed corpus:

In the previous implementation a unique entity was created per entity mention and database identifier. This was fixed to a single entity mention having multiple database ids.
Furthermore, the name of the database a entity is linked to was added
BioRed only provides abstract-level annotations for entity-linked relation pairs rather than materializing links between all surface form mentions of relation. Analogous to BC5CDR we enumerate all mention pairs concerning the entities in the triple.

galtay commented 1 year ago

@davidkartchner can you take a look here and comment on the database names? is there a way we can make these database names consistent with the ones you are using?

@mariosaenger is this PR sensitive to the exact db name? i.e. would it break anything if we use MESH instead of "ChemicalEntity": "Medical Subject Headings (MESH)"

davidkartchner commented 1 year ago

@galtay @mariosaenger In order to be consistent with other datasets, I would use the following mapping from type to database:

TYPE_TO_DATABASE = {
        "CellLine": "Cellosaurus",
        "ChemicalEntity": "MESH",
        "DiseaseOrPhenotypicFeature": "MESH" or "OMIM",
        "GeneOrGeneProduct": "NCBIGene",
        "OrganismTaxon": "NCBITaxon",
        "SequenceVariant": "dbSNP" or "custom",
    }

For cases where an entity type can be linked to multiple databases, it is especially important to correctly specify which database the identifier is coming from to effectively train an entity normalization model later on. For DiseaseOrPhenotypicFeature or SequenceVariant, you can probably determine which database it links to with some basic checks on the string format (e.g. all OMIM normalization have "OMIM" prepended to their identifier). An example can be found at https://huggingface.co/datasets/bigbio/ncbi_disease/blob/main/ncbi_disease.py#L233

mariosaenger commented 1 year ago

@galtay @davidkartchner thanks for the feedback. I revised the implementation to adhere to the database naming scheme.

In a future PR, we could possibly also consider standardising the naming scheme of all NEN data sets via constants.

galtay commented 1 year ago

thanks @mariosaenger . yes I agree, it would be nice to have a check for standardized database names in the unit tests.

bigscience-workshop / biomedical

Revise implementation of BioRED #853