geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Incorrect UniProt KW mapping caused by simple string matching in sequence records. #3980

Closed sjm41 closed 8 months ago

sjm41 commented 2 years ago

NOTE: the KW-0418 mapping and individual genes below are just examples to demonstrate a wider problem.

I'm not sure how the keywords are identified/propagated from EMBL records. The word 'kinase' does appear in the gene/protein name or synonyms in all these records, but not because the protein is a kinase. E.g. for Q9V3H5/Cdk5alpha, the EMBL record has "product="cyclin-dependent kinase 5 regulatory subunit"".

Could that explain how 'kinase' is being added as a keyword to the UniProt entries (and thus the incorrect GO is getting added)?? If so, I think that logic needs to change - can't rely on 'kinase' in a gene name to indicate that the product is a kinase.

(Let me know if this ticket is better directed to UniProt directly rather than on this tracker.)

@Antonialock @hattrill

Antonialock commented 2 years ago

Hi Steven, I was told by Michele that this is a problem with the program that creates TrEMBL entries from ENA records. It adds keywords based on information in the ENA records and it looks like it's seeing that these have "kinase" and adding the "Kinase" keyword. I have reported this to the production list at uniprot-prod@ebi.ac.uk

sjm41 commented 2 years ago

Thanks Antonia - I thought something like that must be the explanation. I'll leave this ticket open for now in case others come across the same problem - could you update here when you hear back from uniprot-prod? (I think I've just come across some similar cases for 'transferase' KW and 'transferase activity' so seems like might be a more general issue....do you want those examples too?)

Antonialock commented 2 years ago

Yes, will report back.

More than happy to pass on more instances but I assume that at best they'll fix them on a case by case basis.

sjm41 commented 2 years ago

OK, thanks. I likely send you some more examples by email.

sjm41 commented 2 years ago

For the record: I found a total of ~25 fly enzymes that are getting incorrect GO annotation via the KW2GO pipeline just because their submitted name in a EMBL accession contained the string 'kinase', 'transferase', 'ligase', 'endonuclease' or 'protease'.

E.g. PICK1 (X2JDZ1) has the submitted name "Protein interacting with C kinase 1", which results in the UP record getting 'kinase' as a keyword, and thus a KW2GO of "kinase activity"...but it's not a kinase itself.

Dushi at UP has confirmed this is the result of UP scripts adding KW based on simple string matching in the submitted name. This might often generate correct results but must be wrong in many cases (as the examples above). He added that "Although these rules are correct when they were first introduced, we are getting more and more requests to remove them now."

Anyone else reading this support UP dropping KW based on string matching?

Antonialock commented 2 years ago

+1 :-)

ValWood commented 1 year ago

Anyone else reading this support UP dropping KW based on string matching?

Seems to be a popular suggestion. Can we bring it up with UniProt @sandraorchard

Antonialock commented 10 months ago

I have asked Dushi if there are any updates on this.

Antonialock commented 10 months ago

@sjm41 message from Dushi:

"I don't think we would be able to completely stop adding KWs based on ENA descriptions. However we can fix the rules to be more precise. For example, In this release we updated our rules to not to add "kinase" KW if the EMBL description is having "kinase inhibitor". You can provide such list for us and we will clean the existing KWs and update our rules. In the meantime, I will remove KW Kinase from the following:

UP_acc FBgn_ID gene_symbol gene_name

Q9V3H5 FBgn0027491 Cdk5alpha Cdk5 activator-like protein Q9VHN1 FBgn0037613 Cks85A Cyclin-dependent kinase subunit 85A Q9W3R6 FBgn0029944 Dok Downstream of kinase X2JDZ1 FBgn0032447 PICK1 Protein interacting with C kinase 1 Q7YZ95 FBgn0037236 Skp2 S-phase kinase-associated protein 2 Q9VDD2 FBgn0264357 SNF4Agamma SNF4/AMP-activated protein kinase gamma subunit D3DML3 FBgn0044323 Cka Connector of kinase to AP-1"

Antonialock commented 10 months ago

so we can string match things to be blocked @ValWood @hattrill

ValWood commented 10 months ago

I wonder if there is any real need to have a GO KW mapping to kinase, since likely all kinases will get mapping via PAINT EC, or InterPro2GO (not, I'm not objecting to the KW, just the mapping, we could clear out some GO KW mappings that might contain false positive, but are unlikely to provide any unique annotation).

hattrill commented 10 months ago

There is some information that is missing here for me: What would be the impact of removing "KWs based on ENA descriptions"? As @ValWood points out, there are already a lot of pipelines generating annotations across species.

sjm41 commented 10 months ago

Thanks for chasing @Antonialock ! I appreciate Dushi saying we can block string matches, and that he's removed the KW kinase from those 7 Dmel examples, but that's not really a scaleable solution - the problem is not restricted to kinases, or to flies, and curators can't be expected to check through new description lines with every ENA update in case more offenders are added. @ValWood suggests dropping KW mapping for kinases, but how about dropping KW mapping (i.e. not keywords, but just the GO mapping) entirely. How much unique accuracy does it really add nowadays (now we have other automated pipelines), even for relatively uncharacterized proteomes?

I see I referred to an extended list of problematic Dmel entries in a previous post, but never posted it here, so doing that below - note, these were problems when I did a systematic search 2 years ago - these are still problems today, but I haven't checked to see if additional problems have crept in. As you can see, it's not just kinases, and the problematic words are mostly not part of 'blockable strings'....

UniProtKB-KW:KW-0418
FBgn0027491/Cdk5alpha (Q9V3H5) gets 'Kinase' KW via EMBL:AAF36977.1 ("cyclin-dependent kinase 5 regulatory subunit") => GO:0016301 kinase activity
FBgn0037613/Cks85A (Q9VHN1) gets 'Kinase' KW via EMBL:AAF54272.1 ("Cyclin-dependent kinase subunit 85A, isoform A") => GO:0016301 kinase activity
FBgn0029944/Dok (Q9W3R6) gets 'Kinase' KW via EMBL:AAF46253.1 ("downstream of kinase") => GO:0016301 kinase activity
FBgn0032447/PICK1 (X2JDZ1) gets 'Kinase' KW via EMBL:AHN54395.1 ("protein interacting with C kinase 1, isoform E") => GO:0016301 kinase activity
FBgn0037236/Skp2 (Q7YZ95) gets 'Kinase' KW via EMBL:AAF52144.3 ("S-phase kinase-associated protein 2, isoform E") => GO:0016301 kinase activity
FBgn0264357/SNF4Agamma (Q9VDD2) gets 'Kinase' KW via EMBL:AAF55864.2 ("SNF4/AMP-activated protein kinase gamma subunit, isoform F") => GO:0016301 kinase activity
FBgn0044323/Cka (D3DML3) gets 'Kinase' KW via EMBL:AFH03610.1 ("connector of kinase to AP-1, isoform E") => GO:0016301 kinase activity
=> 'kinase' string is in DE and FT fields, but they aren't kinases

UniProtKB-KW:KW-0808
FBgn0034277/OstDelta (Q7K110) gets 'Transferase' KW via EMBL:AAF57793.1 ("oligosaccharide transferase delta subunit") => GO:0016740 transferase activity
FBgn0032015/Ostgamma (Q8SY53) gets 'Transferase' KW via EMBL:AAF52636.2 ("oligosaccharide transferase gamma subunit") => GO:0016740 transferase activity
FBgn0036470/EAChm (Q9VUK1) gets 'Transferase' KW via EMBL:AAF49675.1 ("enhancer of acetyltransferase chameau") => GO:0016740 transferase activity
FBgn0031020/Naa15-16 (Q9VWI2) gets 'Transferase' KW via EMBL:AAF48957.1 ("N(alpha)-acetyltransferase 15/16, isoform A") => GO:0016740 transferase activity
=> 'transferase' string is in DE and FT fields, but non-catalytic subunit

UniProtKB-KW:KW-0436
Gclm/FBgn0046114 (Q9VCW6) gets 'ligase' KW via EMBL:AAF56039.1 ("Glutamate-cysteine ligase modifier subunit") => GO:0016874  ligase activity
=> 'ligase' string is in DE and FT fields, but non-catalytic subunit

UniProtKB-KW:KW-0255
FBgn0036266/Tsen54 (Q9VTV4) gets 'endonuclease' KW via EMBL:AAF49941.2 ("tRNA splicing endonuclease subunit 54") => GO:0004519 endonuclease activity
FBgn0050343/Tsen15 (A1Z7S9) gets 'Endonuclease' KW via EMBL:AAM68822.1 ("tRNA splicing endonuclease subunit 15, isoform A") => GO:0004519 endonuclease activity
=> 'endonuclease' string is in DE and FT fields, but non-catalytic subunit

UniProtKB-KW:KW-0645
FBgn0028983/Spn55B (Q7JV69) gets 'protease' KW via EMBL:CAB63101.1 ("serine protease inhibitor (serpin-6)") => GO:0008233 peptidase activity
=> 'protease' string is in DE and FT fields, but is protease inhibitor

UniProtKB-KW:KW-0378
FBgn0037773/CG5359 (Q9VH45) gets 'Hydrolase' KW via EMBL:AJ863565.1 => GO:0016787 hydrolase activity
=> 'hydrolase' is in KW field, though it shouldn't be since this is a dynein light chain
alexsign commented 9 months ago

@sjm41 Hi Steven, I don't think we are ready to remove GO to KW mappings. It's still give us around 190 million annotations which not covered by other methods. The UniProt production team is looking into fixing issues you raised, and they are very thankful for detailed report. As I understand, they are not using github, so feel free to email them directly in the future uniprot-prod@ebi.ac.uk

Antonialock commented 8 months ago

Closing.