SuLab / GeneWikiCentral

GeneWiki Organization
MIT License
5 stars 2 forks source link

Removing unique constraint from RefSeq Protein ID #98

Open djow2019 opened 6 years ago

djow2019 commented 6 years ago

Due to NCBI's ref seq reannotation project, many ref seq ids are no longer unique to specific organisms. NCBI is combining proteins with identical structure to a single reference id, so it will no longer be possible to query a protein using only the ref seq protein ID. Two workarounds: 1) use uniprot IDs (which are still unique), or 2) combine ref seq protein ID with tax ID. More information, see here https://www.wikidata.org/wiki/Property_talk:P637#Remove_Distinct_Value_Constraint.

stuppie commented 6 years ago

Also, as we discussed, you'll be loading new taxons with proteins with no uniprot IDs. I think this would only impact the GO Annotation bot (https://github.com/SuLab/scheduled-bots/blob/master/scheduled_bots/geneprotein/GOBot.py). However, if they have no annotations in Quickgo, it won't matter. And the interpro bot (https://github.com/SuLab/scheduled-bots/blob/master/scheduled_bots/interpro/ProteinBot.py), but again, the annotations are by uniprot ID, so if no uniprot ID, no annotations either. I'll look into this once the new taxon is loaded.