RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Incorrect Labels for some UniProtKB proteins in KG2 #1259

Closed isbluis closed 3 years ago

isbluis commented 3 years ago

A few dozen protein entries from UniProtKB appear to have incorrect labels, seen as having the string "Synonyms=xxxx" appended to them, e.g.: https://arax.ncats.io/devLM/index.html?term=UniProtKB:P55316

Full list:

Entry Label Type
UniProtKB:A5PKW4 PSDSynonyms=EFA6, protein
UniProtKB:O14493 CLDN4Synonyms=CPER protein
UniProtKB:O15047 SETD1ASynonyms=KIAA0339 protein
UniProtKB:O15523 DDX3YSynonyms=DBY protein
UniProtKB:O43504 LAMTOR5Synonyms=HBXIP protein
UniProtKB:O60231 DHX16Synonyms=DBP2, protein
UniProtKB:O94759 TRPM2Synonyms=EREG1 protein
UniProtKB:O95259 KCNH1Synonyms=EAG protein
UniProtKB:P00403 MT-CO2Synonyms=COII, protein
UniProtKB:P04629 NTRK1Synonyms=MTC, protein
UniProtKB:P32418 SLC8A1Synonyms=CNC, protein
UniProtKB:P43003 SLC1A3Synonyms=EAAT1 protein
UniProtKB:P51648 ALDH3A2Synonyms=ALDH10 protein
UniProtKB:P55316 FOXG1Synonyms=FKH2, protein
UniProtKB:P55735 SEC13Synonyms=D3S1231E, protein
UniProtKB:P59901 LILRA4Synonyms=ILT7 protein
UniProtKB:P60014 KRTAP10-10Synonyms=KAP10.10, protein
UniProtKB:P60412 KRTAP10-11Synonyms=KAP10.11, protein
UniProtKB:P60413 KRTAP10-12Synonyms=KAP10.12, protein
UniProtKB:Q01167 FOXK2Synonyms=ILF protein
UniProtKB:Q01650 SLC7A5Synonyms=CD98LC, protein
UniProtKB:Q06416 POU5F1BSynonyms=OCT4PG1, protein
UniProtKB:Q06432 CACNG1Synonyms=CACNLG protein
UniProtKB:Q0WX57 USP17L24Synonyms=USP17, protein
UniProtKB:Q13573 SNW1Synonyms=SKIIP, protein
UniProtKB:Q13698 CACNA1SSynonyms=CACH1, protein
UniProtKB:Q14494 NFE2L1Synonyms=HBZ17, protein
UniProtKB:Q15648 MED1Synonyms=ARC205, protein
UniProtKB:Q15811 ITSN1Synonyms=ITSN protein
UniProtKB:Q15848 ADIPOQSynonyms=ACDC, protein
UniProtKB:Q2NKX8 ERCC6LSynonyms=PICH protein
UniProtKB:Q3V6T2 CCDC88ASynonyms=APE protein
UniProtKB:Q5W186 CST9Synonyms=CLM protein
UniProtKB:Q5XXA6 ANO1Synonyms=DOG1 protein
UniProtKB:Q6ZR08 DNAH12Synonyms=DHC3, protein
UniProtKB:Q7L5Y9 MAEASynonyms=EMP protein
UniProtKB:Q7Z6J0 SH3RF1Synonyms=KIAA1494, protein
UniProtKB:Q86SK9 SCD5Synonyms=ACOD4 protein
UniProtKB:Q8IUG1 KRTAP1-3Synonyms=B2B, protein
UniProtKB:Q8IWB7 WDFY1Synonyms=FENS1 protein
UniProtKB:Q8IWT6 LRRC8ASynonyms=KIAA1437, protein
UniProtKB:Q8IXQ6 PARP9Synonyms=BAL protein
UniProtKB:Q8IXZ2 ZC3H3Synonyms=KIAA0150, protein
UniProtKB:Q8IZJ1 UNC5BSynonyms=P53RDL1 protein
UniProtKB:Q8NDX1 PSD4Synonyms=EFA6B protein
UniProtKB:Q8NHM5 KDM2BSynonyms=CXXC2, protein
UniProtKB:Q8TEC5 SH3RF2Synonyms=POSH3 protein
UniProtKB:Q92614 MYO18ASynonyms=CD245 protein
UniProtKB:Q92990 GLMNSynonyms=FAP48 protein
UniProtKB:Q96BI1 SLC22A18Synonyms=BWR1A, protein
UniProtKB:Q96GG9 DCUN1D1Synonyms=DCN1 protein
UniProtKB:Q96J94 PIWIL1Synonyms=HIWI protein
UniProtKB:Q96KP4 CNDP2Synonyms=CN2, protein
UniProtKB:Q96KV7 WDR90Synonyms=C16orf15, protein
UniProtKB:Q96PE5 OPALINSynonyms=HTMP10 protein
UniProtKB:Q9BSY9 DESI2Synonyms=C1orf121, protein
UniProtKB:Q9BUR5 APOOSynonyms=FAM121B, protein
UniProtKB:Q9BYW2 SETD2Synonyms=HIF1, protein
UniProtKB:Q9C000 NLRP1Synonyms=CARD7, protein
UniProtKB:Q9H4E5 RHOJSynonyms=ARHJ, protein
UniProtKB:Q9HD42 CHMP1ASynonyms=CHMP1 protein
UniProtKB:Q9NTI5 PDS5BSynonyms=APRIN, protein
UniProtKB:Q9NTJ5 SACM1LSynonyms=KIAA0851 protein
UniProtKB:Q9NWU2 GID8Synonyms=C20orf11, protein
UniProtKB:Q9NWV8 BABAM1Synonyms=C19orf62, protein
UniProtKB:Q9P0L9 PKD2L1Synonyms=PKD2L, protein
UniProtKB:Q9UBG3 CRNNSynonyms=C1orf10 protein
UniProtKB:Q9UPN3 MACF1Synonyms=ABP620, protein
UniProtKB:Q9Y2K7 KDM2ASynonyms=CXXC8, protein
UniProtKB:Q9Y4F9 RIPOR2Synonyms=C6orf32, protein
saramsey commented 3 years ago

Hi @isbluis thank you for bringing this to our attention.

This appears to be a bug in KG2. Verified the bug in KG2.5.1 using cypher

match (n {id: 'UniProtKB:A5PKW4'}) return n.id, n.name, n.full_name;

results show a weird name field consistent with what was seen in the UI. In KG2 we have:

Screen Shot 2021-02-11 at 5 00 47 PM
isbluis commented 3 years ago

Great, thanks for looking into it @saramsey ! (and sorry if I abused the tag; was not sure which to use)

kvarforl commented 3 years ago

Okay I'm making progress on this issue, and in doing so I noticed a small bug that was keeping the GN 'synonyms' from being appended to the node synonyms. I fixed this, but now some of the synonyms have evidence codes attached to them.

I'm assuming this is not desirable, so I'm going to remove them for now, but note it here for if we ever want to do something specific with them.

As an example, here are the GN lines of uniprot_dat file entry for UniProtKB:Q9Y4F9.

GN   Name=RIPOR2;
GN   Synonyms=C6orf32, DIFF48, FAM65B, KIAA0386,
GN   PL48 {ECO:0000303|PubMed:9055809};
isbluis commented 3 years ago

Great, thanks for fixing this, @kvarforl !

One idea that recently occurred to me is that it might be worth looking at the Uniprot entries in KG2 that have the longest string values for name (say, the top 20 or 50) as a way to see if there are other potential parsing bugs still lurking -- most protein (short) names being only 4-6 characters long. Perhaps you are already doing this or something better, so apologies if this is not too useful.

kvarforl commented 3 years ago

@isbluis this is a great idea! I've recently been pondering various ways to catch some of the kg2 bugs before someone has to stumble upon and discover them, but haven't gotten around to doing any of it. This sounds like a great place to start. thanks for the suggestion!

isbluis commented 3 years ago

Excellent! Perhaps another way is to look for strings in that same field that contain characters that are non-alpha/numerical (e.g. equals sign, comma, etc.) Perhaps even lowercase letters?

kvarforl commented 3 years ago

Fixed in kg2.5.2