Inconsistencies in entity normalization for NLM Gene and NCBI Disease datasets

Describe the bug

For some normalizations have been concatenated together into one omnibus entity. Affects ncbi_disease
Additionally, some of the DB names are inconsistent across datasets (e.g. "MESH" vs "mesh", 'ncbigene' vs 'NCBI Gene', etc.). Affects ncbi_disease, gnormplus, medmentions_st21pv, and nlm_gene

Steps to reproduce the bug

from bigbio.dataloader import BigBioConfigHelpers
conhelps = BigBioConfigHelpers()
dataset = 'ncbi_disease'
data = conhelps.for_config_name(f"{dataset}_bigbio_kb").load_dataset()

# Produce example of incorrectly concatenated entities
composite_normalizations = []
for ent_list in data['train']['entities']:
    for x in ent_list:
        if '|' in x['normalized'][0]['db_id']:
            composite_normalizations.append(x)

print(composite_normalizations)

# Produce example of inconsistently formatted entity normalizations between MedMentions Full and MedMentions ST21PV
for dataset in ['medmentions_full','medmentions_st21pv']:
    data = conhelps.for_config_name(f"{dataset}_bigbio_kb").load_dataset()
    print(data['train']['entities'][0][0])

# Produce example of inconsistently formatted db_names between GNormPlus and NLM-Gene
# Note that both databases link to Entrez (also called NCBI Gene)
for dataset in ['gnormplus', 'nlm_gene']:
    data = conhelps.for_config_name(f"{dataset}_bigbio_kb").load_dataset()
    print(data['train']['entities'][1][0])

Expected results

Expected results when there are multiple normalizations for a single entity

normalized should return a list of normalizations where each normalization has exactly one database identifier.

For example, entity 10192393_D003110|D009369_5 should have normalization as follows:

[{'id': '10192393_D003110|D009369_5', ..., 
  'normalized': [{'db_name': 'MESH, 'db_id': 'D003110', {'db_name': 'MESH': 'db_id': 'D009369'}]

Similarly, entity 7790377_OMIM:202370|OMIM:214100_2 should appear as follows:

{'id': '7790377_OMIM:202370|OMIM:214100_2',
  'normalized': [{'db_name': 'OMIM', 'db_id': '202370'}, {'db_name': 'OMIM', 'db_id': '214100'}]}

Expected results for different datasets linked to same ontology

For medmentions, we expect the normalizations to have identical format between medmentions_full and medmentions_st21pv Expected normalization of first entity of MedMentions Full: {'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]} Expected normalization of first entity of MedMentions ST21PV (should be identical) {'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]}

We additionally expect that GNormPlus and NLM-Gene will have the same database name since they link to the same ontology (NCBI Gene/Entrez) Expected first entity of GNormPlus: {'id': '16', 'type': 'Gene', 'text': ['SYT'], 'offsets': [[26, 29]], 'normalized': [{'db_name': 'NCBIGene', 'db_id': '6760'}]} Expected first entity of NLM-Gene: {'id': '18', 'type': 'GENERIF', 'text': ['Brat'], 'offsets': [[0, 4]], 'normalized': [{'db_name': 'NCBIGene', 'db_id': '35197'}]}

We expect this behavior to also be the same between other dataset pairs linked to the same db, e.g. NCBI-Disease and BC5CDR which both link to MeSH.

Actual results

Multiple normalizations in NCBI-Disease

Some entities have a single, database identifier that is formed as a concatenation of multiple db_ids.

For entity 10192393_D003110|D009369_5 above, the actual normalization is:

{'id': '10192393_D003110|D009369_5', ...
  'normalized': [{'db_name': 'mesh', 'db_id': 'D003110|D009369'}]},

Similarly, the actual normalization for entity 7790377_OMIM:202370|OMIM:214100_2 is:

 {'id': '7790377_OMIM:202370|OMIM:214100_2', ...
  'normalized': [{'db_name': 'omim', 'db_id': 'OMIM:202370|OMIM:214100'}]}

You may also note that db_name is different in the expected vs. actual result as well. This addresses the following inconsistent naming problem.

Inconsistent DB naming

In MedMentions ST21PV, an extra UMLS: is prepended to every db_id. This means that the first entity normalization in MedMentions ST21PV is: {'id': '1', 'type': 'T116', 'text': ['DCTN4'], 'offsets': [[0, 5]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'UMLS:C4308010'}]}

The name of the NCBI Gene database is inconsistent between GNormPlus and NLM-Gene. First entity of GNormPlus {'id': '16', 'type': 'Gene', 'text': ['SYT'], 'offsets': [[26, 29]], 'normalized': [{'db_name': 'NCBI', 'db_id': '6760'}]} First entity of NLM-Gene: {'id': '18', 'type': 'GENERIF', 'text': ['Brat'], 'offsets': [[0, 4]], 'normalized': [{'db_name': 'NCBI Gene identifier', 'db_id': '35197'}]}

A similar problem is observed between BC5CDR ('db_name':'MESH') and NCBI-Disease ('db_name':'mesh')

Environment info

datasets version: 2.1.0
Platform: macOS-12.3.1-arm64-arm-64bit
Python version: 3.9.7
PyArrow version: 9.0.0
Pandas version: 1.3.3

bigscience-workshop / biomedical