Closed isbluis closed 3 years ago
Hi @isbluis thank you for bringing this to our attention.
This appears to be a bug in KG2. Verified the bug in KG2.5.1 using cypher
match (n {id: 'UniProtKB:A5PKW4'}) return n.id, n.name, n.full_name;
results show a weird name
field consistent with what was seen in the UI. In KG2 we have:
Great, thanks for looking into it @saramsey ! (and sorry if I abused the tag; was not sure which to use)
Okay I'm making progress on this issue, and in doing so I noticed a small bug that was keeping the GN 'synonyms' from being appended to the node synonyms. I fixed this, but now some of the synonyms have evidence codes attached to them.
I'm assuming this is not desirable, so I'm going to remove them for now, but note it here for if we ever want to do something specific with them.
As an example, here are the GN lines of uniprot_dat file entry for UniProtKB:Q9Y4F9
.
GN Name=RIPOR2;
GN Synonyms=C6orf32, DIFF48, FAM65B, KIAA0386,
GN PL48 {ECO:0000303|PubMed:9055809};
Great, thanks for fixing this, @kvarforl !
One idea that recently occurred to me is that it might be worth looking at the Uniprot entries in KG2 that have the longest string values for name (say, the top 20 or 50) as a way to see if there are other potential parsing bugs still lurking -- most protein (short) names being only 4-6 characters long. Perhaps you are already doing this or something better, so apologies if this is not too useful.
@isbluis this is a great idea! I've recently been pondering various ways to catch some of the kg2 bugs before someone has to stumble upon and discover them, but haven't gotten around to doing any of it. This sounds like a great place to start. thanks for the suggestion!
Excellent! Perhaps another way is to look for strings in that same field that contain characters that are non-alpha/numerical (e.g. equals sign, comma, etc.) Perhaps even lowercase letters?
Fixed in kg2.5.2
A few dozen protein entries from UniProtKB appear to have incorrect labels, seen as having the string "Synonyms=xxxx" appended to them, e.g.: https://arax.ncats.io/devLM/index.html?term=UniProtKB:P55316
Full list: