For some normalizations have been concatenated together into one omnibus entity. Affects ncbi_disease
Additionally, some of the DB names are inconsistent across datasets (e.g. "MESH" vs "mesh", 'ncbigene' vs 'NCBI Gene', etc.). Affects ncbi_disease, gnormplus, medmentions_st21pv, and nlm_gene
Steps to reproduce the bug
from bigbio.dataloader import BigBioConfigHelpers
conhelps = BigBioConfigHelpers()
dataset = 'ncbi_disease'
data = conhelps.for_config_name(f"{dataset}_bigbio_kb").load_dataset()
# Produce example of incorrectly concatenated entities
composite_normalizations = []
for ent_list in data['train']['entities']:
for x in ent_list:
if '|' in x['normalized'][0]['db_id']:
composite_normalizations.append(x)
print(composite_normalizations)
# Produce example of inconsistently formatted entity normalizations between MedMentions Full and MedMentions ST21PV
for dataset in ['medmentions_full','medmentions_st21pv']:
data = conhelps.for_config_name(f"{dataset}_bigbio_kb").load_dataset()
print(data['train']['entities'][0][0])
# Produce example of inconsistently formatted db_names between GNormPlus and NLM-Gene
# Note that both databases link to Entrez (also called NCBI Gene)
for dataset in ['gnormplus', 'nlm_gene']:
data = conhelps.for_config_name(f"{dataset}_bigbio_kb").load_dataset()
print(data['train']['entities'][1][0])
Expected results
Expected results when there are multiple normalizations for a single entity
normalized should return a list of normalizations where each normalization has exactly one database identifier.
For example, entity 10192393_D003110|D009369_5 should have normalization as follows:
Expected results for different datasets linked to same ontology
For medmentions, we expect the normalizations to have identical format between medmentions_full and medmentions_st21pv
Expected normalization of first entity of MedMentions Full:
{'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]}
Expected normalization of first entity of MedMentions ST21PV (should be identical)
{'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]}
We additionally expect that GNormPlus and NLM-Gene will have the same database name since they link to the same ontology (NCBI Gene/Entrez)
Expected first entity of GNormPlus:
{'id': '16', 'type': 'Gene', 'text': ['SYT'], 'offsets': [[26, 29]], 'normalized': [{'db_name': 'NCBIGene', 'db_id': '6760'}]}
Expected first entity of NLM-Gene:
{'id': '18', 'type': 'GENERIF', 'text': ['Brat'], 'offsets': [[0, 4]], 'normalized': [{'db_name': 'NCBIGene', 'db_id': '35197'}]}
We expect this behavior to also be the same between other dataset pairs linked to the same db, e.g. NCBI-Disease and BC5CDR which both link to MeSH.
Actual results
Multiple normalizations in NCBI-Disease
Some entities have a single, database identifier that is formed as a concatenation of multiple db_ids.
For entity 10192393_D003110|D009369_5 above, the actual normalization is:
You may also note that db_name is different in the expected vs. actual result as well. This addresses the following inconsistent naming problem.
Inconsistent DB naming
In MedMentions ST21PV, an extra UMLS: is prepended to every db_id. This means that the first entity normalization in MedMentions ST21PV is:
{'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'UMLS:C4308010'}]}
The name of the NCBI Gene database is inconsistent between GNormPlus and NLM-Gene.
First entity of GNormPlus
{'id': '16', 'type': 'Gene', 'text': ['SYT'], 'offsets': [[26, 29]], 'normalized': [{'db_name': 'NCBI', 'db_id': '6760'}]}
First entity of NLM-Gene:
{'id': '18', 'type': 'GENERIF', 'text': ['Brat'], 'offsets': [[0, 4]], 'normalized': [{'db_name': 'NCBI Gene identifier', 'db_id': '35197'}]}
A similar problem is observed between BC5CDR ('db_name':'MESH') and NCBI-Disease ('db_name':'mesh')
Describe the bug
ncbi_disease
ncbi_disease
,gnormplus
,medmentions_st21pv
, andnlm_gene
Steps to reproduce the bug
Expected results
Expected results when there are multiple normalizations for a single entity
normalized
should return a list of normalizations where each normalization has exactly one database identifier.For example, entity
10192393_D003110|D009369_5
should have normalization as follows:Similarly, entity
7790377_OMIM:202370|OMIM:214100_2
should appear as follows:Expected results for different datasets linked to same ontology
For medmentions, we expect the normalizations to have identical format between
medmentions_full
andmedmentions_st21pv
Expected normalization of first entity of MedMentions Full:{'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]}
Expected normalization of first entity of MedMentions ST21PV (should be identical){'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]}
We additionally expect that GNormPlus and NLM-Gene will have the same database name since they link to the same ontology (NCBI Gene/Entrez) Expected first entity of GNormPlus:
{'id': '16', 'type': 'Gene', 'text': ['SYT'], 'offsets': [[26, 29]], 'normalized': [{'db_name': 'NCBIGene', 'db_id': '6760'}]}
Expected first entity of NLM-Gene:{'id': '18', 'type': 'GENERIF', 'text': ['Brat'], 'offsets': [[0, 4]], 'normalized': [{'db_name': 'NCBIGene', 'db_id': '35197'}]}
We expect this behavior to also be the same between other dataset pairs linked to the same db, e.g. NCBI-Disease and BC5CDR which both link to MeSH.
Actual results
Multiple normalizations in NCBI-Disease
Some entities have a single, database identifier that is formed as a concatenation of multiple db_ids.
For entity
10192393_D003110|D009369_5
above, the actual normalization is:Similarly, the actual normalization for entity
7790377_OMIM:202370|OMIM:214100_2
is:You may also note that
db_name
is different in the expected vs. actual result as well. This addresses the following inconsistent naming problem.Inconsistent DB naming
In MedMentions ST21PV, an extra
UMLS:
is prepended to everydb_id
. This means that the first entity normalization in MedMentions ST21PV is:{'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'UMLS:C4308010'}]}
The name of the NCBI Gene database is inconsistent between GNormPlus and NLM-Gene. First entity of GNormPlus
{'id': '16', 'type': 'Gene', 'text': ['SYT'], 'offsets': [[26, 29]], 'normalized': [{'db_name': 'NCBI', 'db_id': '6760'}]}
First entity of NLM-Gene:{'id': '18', 'type': 'GENERIF', 'text': ['Brat'], 'offsets': [[0, 4]], 'normalized': [{'db_name': 'NCBI Gene identifier', 'db_id': '35197'}]}
A similar problem is observed between BC5CDR (
'db_name':'MESH'
) and NCBI-Disease ('db_name':'mesh'
)Environment info
datasets
version: 2.1.0