How to improve named entity normalization for human proteins?

dhimmel commented 2 years ago

Very excited to see BERN2! Really nice work so far.

I'm looking to map certain mentions of proteins to standard identifiers. Here's a list of these proteins, where each protein is also followed by a direction of activity:

3 beta hydroxysteroid dehydrogenase 5 stimulator; AF4/FMR2 protein 2 inhibitor; Adenylate cyclase 2 stimulator; Alpha gamma adaptin binding protein p34 stimulator; BR serine threonine protein kinase 1 stimulator; Complement Factor B stimulator; DNA gyrase B inhibitor; Ectonucleotide pyrophosphatase-PDE-3 stimulator; Falcipain 1 stimulator; Homeobox protein Nkx 2.4 stimulator; ISLR protein inhibitor; Integrin alpha-IIb/beta-4 antagonist; Inter alpha trypsin inhibitor H5 stimulator; Interleukin receptor 17B antagonist; Isopropylmalate dehydrogenase stimulator; Methylthioadenosine nucleosidase stimulator; Patched domain containing protein 2 inhibitor; Protein FAM161A stimulator; Protocadherin gamma A1 inhibitor; Ring finger protein 4 stimulator; SMAD-9 inhibitor; Small ubiquitin related modifier 1 inhibitor; Sodium-dicarboxylate cotransporter-1 inhibitor; Sorting nexin 9 inhibitor; Sugar phosphate exchanger 2 stimulator; Transcription factor p65 stimulator; Tumor necrosis factor 14 ligand inhibitor; Ubiquitin-conjugating enzyme E21 stimulator; Unspecified ion channel inhibitor; Zinc finger BED domain protein 6 inhibitor

Using the nice web interface, I get:

So overall BERN2 does a good job recognizing the protein mentions. However, we actually already know what the protein text is, and are more interested in normalization. Most of the gene/protein mentions receive "ID: CUI-less". Any advice on how to improve the performance of named entity normalization for human proteins?

I see that the website notes that normalization is done by https://github.com/dmis-lab/BioSyn, so feel free to migrate this issue to that repo if it's best there.

mjeensung commented 2 years ago

Hi @dhimmel, Thank you for your interest in BERN2.

BioSyn, the neural network normalizer, currently only supports disease and chemical types. Please note that we place an asterisk next to a CUI that has been normalized by 'BioSyn' (e.g., ID: MESH:D013217*).

For the gene/protein type, we are using an off-the-shelf gene type normalizer GNormPlus and the human proteins in your examples are the entities that GNormPlus could not normalize.

If a better gene/protein type normalizer is released in the future, we are planning to replace it with the current gene/protein type normalizer.

dhimmel commented 2 years ago

Thanks @mjeensung for the clarification. Feel free to post any leads on better gene/protein normalizers here... I'm happy to help evaluate.

Looking at the GNormPlus docs, it does "mention recognition and concept normalization". So are you able to just apply GNormPlus at the concept normalization stage for genes, while using the mention recognition from BERN2? I think the code I'm asking about is:

https://github.com/dmis-lab/BERN2/blob/20cef6bd0ff45d75031f5001ec2e78a2c21d1506/bern2/normalizer.py#L307-L401

cthoyt commented 2 years ago

Here's our off the shelf gene (and other entity) normalizer that's ready for use: https://github.com/indralab/gilda

mjeensung commented 2 years ago

@dhimmel, that's correct.

For genes, mentions are recognized by the BERN2 NER model (better performance than GNormPlus) and normalized by GNormPlus.

mjeensung commented 2 years ago

Thank you for recommending this great tool, @cthoyt. We will look into the tool, Gilda, and see if we can incorporate it into BERN2.

dmis-lab / BERN2

How to improve named entity normalization for human proteins? #2