dmis-lab / BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool
http://bern2.korea.ac.kr
BSD 2-Clause "Simplified" License
170 stars 40 forks source link

Use preferred bioregistry prefixes for normalized entity identifiers #3

Closed dhimmel closed 2 years ago

dhimmel commented 2 years ago

Great to see that BERN2 normalizes entities to compact identifiers in resource:identifier format. I noticed that there is an opportunity to standardize the prefixes used with Bioregistry:

FYI I didn't check all the entity types BERN2 is capable of tagging for whether they use the preferred prefix.

@cthoyt might also be helpful here.

cthoyt commented 2 years ago

I’d be happy to help. I’d like to try running bern2 myself locally and I’m sure doing this would make it easier to evaluate if the results are useful

mjeensung commented 2 years ago

Hi @dhimmel

Thank you for your suggestions for improving BERN2.

Do you mean that it is more standardized to use NCBIGene:10533 (for gene/protein) and NCBITaxon:10095 (for species) instead of EntrezGene:10533 and NCBI:txid10095?

dhimmel commented 2 years ago

Do you mean that it is more standardized to use NCBIGene:10533 (for gene/protein) and NCBITaxon:10095 (for species) instead of EntrezGene:10533 and NCBI:txid10095?

Exactly. I see a benefit if all entities tagged are represented as Bioregistry supported CURIEs to make integration with other datasets the most straightforward as possible.

Additional notes:

cthoyt commented 2 years ago

I just added in EntrezGene to https://bioregistry.io/registry/ncbigene, but I don't really see a standardized way to reconcile things that look like NCBI:txid10095 since it doesn't follow the spirit of CURIEs. You could always do some text-based preprocessing if you get stuff like this.

mjeensung commented 2 years ago

@dhimmel @cthoyt

Thank you for your suggestions! We will consider replacing current prefixes with the prefixes in BioRegistry.

mjeensung commented 2 years ago

Hi @cthoyt,

I was checking BioRegistry and noticed something that I'd like to clarify. While Entrez Gene ID has the preferred prefix NCBIGene, MESH ID does not and instead uses the prefix mesh in its CURIE. My first question is why some CURIEs use preferred prefixes while others use prefixes. My second question is whether it is common in CURIE to use lowercased prefixes such as mesh:C063233 rather than MESH:C063233 as BERN2 does.

mjeensung commented 2 years ago

https://github.com/dmis-lab/BERN2/commit/bbad178247047faf58ff204ea6adb383ae86717f