clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Some protein synonyms from UniProt missing from NER file #50

Open bgyori opened 3 years ago

bgyori commented 3 years ago

When testing the new protein chain grounding feature, I found something odd, namely, several synonyms that I confirmed to be listed in uniprot-proteins.tsv.gz, e.g., Spike glycoprotein were not recognized in text at all (a NER issue). So I checked ner/Gene_or_gene_product.tsv.gz and found that these synonyms didn't appear in that file. So I'm wondering if either we didn't update the NER files to match the latest UniProt groundings file, or if there is some process that filters out these synonyms when generating the NER file. It looks like I last changed uniprot-proteins.tsv.gz on Wed Sep 9 23:03:28 2020, and ner/Gene_or_gene_product.tsv.gz was updated on Thu Oct 15 17:26:23 2020 so that update should have included these synonyms. I'll keep looking into this and will post if I find something more.

bgyori commented 3 years ago

Oh I think I figured it out, I think this has to do with species settings. Namely, though Reach grounds to proteins from multiple species if NER picks up an entity, it doesn't actually include any synonyms in the NER file from species that aren't in the configures "valid species list". And as far as I can gather, in ner_kb.config, "Human" and "Homo sapiens" in the line

uniprot-proteins»···Gene_or_gene_product»···Human»··Homo sapiens

means that only Human proteins are considered to be from a "valid species", as determined in this line: https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L101.

So it seems like this is the correct behavior here but we'll have to think more and test what happens if we allow other species to be added in ner_kb.config. In any case, with the current setting, most viral proteins are never picked up by NER, even though we have their synonyms in the grounding table.

kwalcock commented 3 years ago

Thanks very much for the documentation. Someone will be searching for these details sometime in the future, probably me.

MihaiSurdeanu commented 3 years ago

Thanks @bgyori ! I think you are correct. So, what would be the correct solution? We could set all viral proteins to human? Or come up with a new "species" for viral proteins to capture all these proteins, even if they come from different actual species?

bgyori commented 3 years ago

There are several ways we can improve on this situation. Let's start with the options that continue the current approach where the state of the NER file is determined at compile time and not configurable during runtime.

First, if the goal is to add a specific subtree of the taxonomy to NER, like "Viruses", the most low-tech option is to duplicate every viral protein synonym in uniprot-proteins.tsv with a line where the species is "Viruses", and then change ner_kb.config to add Viruses along with Homo sapiens. This duplication of entries would be similar to how currently species synonyms (both Human and Homo sapins are listed) are generated out redundantly in uniprot-proteins.tsv:

    100 kDa coactivator»Q66X93»·Rat                                                 
    100 kDa coactivator»Q66X93»·Rattus norvegicus                                   
    100 kDa coactivator»Q78PY7»·Mouse                                               
    100 kDa coactivator»Q78PY7»·Mus musculus                                        
    100 kDa coactivator»Q7KZF4»·Homo sapiens                                        
    100 kDa coactivator»Q7KZF4»·Human                                               

Obviously, this solution is limited and requires some manual work so isn't very general.

A more general solution would be to make use of the structure of the Taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy) and generalize the implementation of containsValidSpecies to accept child terms in the taxonomy. This way, if we just add "Viruses" to ner_kb.config, it would automatically accept all terms that are for organisms in the taxonomy that are descendant terms of Viruses, without us having to add additional/redundant lines to uniprot-proteins.tsv.

Finally, the more generally useful and flexible solution would be to make the choice of organisms included in NER configurable, for instance, at the level of Reach's application.conf. I am not sure if there are technical constraints that would make this difficult to implement but I can see how it might require more complex refactoring. Still, it's worth discussing the possibility of doing this. This would allow us to set the configuration adaptively based on annotations we have for a given (set of) paper(s) to choose the optimal set of organisms included in NER.

MihaiSurdeanu commented 3 years ago

Thanks @bgyori !

I like the last option too. But unfortunately, that is tricky to implement today because there are some offline steps that need to happen between uniprot-proteins.tsv and the data that is actually loaded in Reach. And these steps depend on the species I believe. It is possible to do this, but it would take some engineering work to do.

Adding @enoriega and @kwalcock, who might be able to dedicate some cycles to this work in the new year.

bgyori commented 3 years ago

Sounds good! Happy to help if there are any questions about how to do this. Until then, I will try the first option, just to test what happens if we include some other species.

MihaiSurdeanu commented 3 years ago

Great. Please keep us posted.

On Tue, Dec 22, 2020 at 12:32 Benjamin M. Gyori notifications@github.com wrote:

Sounds good! Happy to help if there are any questions about how to do this. Until then, I will try the first option, just to test what happens if we include some other species.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/issues/50#issuecomment-749735610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TUSO4W2VVBTIDLWL4DSWDX6JANCNFSM4VCY54ZQ .