Closed mariosaenger closed 1 year ago
@davidkartchner can you take a look here and comment on the database names? is there a way we can make these database names consistent with the ones you are using?
@mariosaenger is this PR sensitive to the exact db name? i.e. would it break anything if we use MESH
instead of "ChemicalEntity": "Medical Subject Headings (MESH)"
@galtay @mariosaenger In order to be consistent with other datasets, I would use the following mapping from type to database:
TYPE_TO_DATABASE = {
"CellLine": "Cellosaurus",
"ChemicalEntity": "MESH",
"DiseaseOrPhenotypicFeature": "MESH" or "OMIM",
"GeneOrGeneProduct": "NCBIGene",
"OrganismTaxon": "NCBITaxon",
"SequenceVariant": "dbSNP" or "custom",
}
For cases where an entity type can be linked to multiple databases, it is especially important to correctly specify which database the identifier is coming from to effectively train an entity normalization model later on. For DiseaseOrPhenotypicFeature
or SequenceVariant
, you can probably determine which database it links to with some basic checks on the string format (e.g. all OMIM normalization have "OMIM" prepended to their identifier). An example can be found at https://huggingface.co/datasets/bigbio/ncbi_disease/blob/main/ncbi_disease.py#L233
@galtay @davidkartchner thanks for the feedback. I revised the implementation to adhere to the database naming scheme.
In a future PR, we could possibly also consider standardising the naming scheme of all NEN data sets via constants.
thanks @mariosaenger . yes I agree, it would be nice to have a check for standardized database names in the unit tests.
This PR improves the implementation of the BioRed corpus: