SuLab / GeneWikiCentral

GeneWiki Organization
MIT License
5 stars 2 forks source link

Human proteins labeled as fanconi anemia disease subtypes #63

Closed putmantime closed 6 years ago

putmantime commented 6 years ago

Hi @stuppie, I noticed this while trying to make a gene to phenotype link from the human gene FANCA to the disease Fanconi anemia.

There are two items in wikidata titled 'Fanconi Anemia Complementation Group A': https://www.wikidata.org/wiki/Q21101242 -a subclass of protein that is linked to the gene FANCA through encoded by (this is where the error seems to be) https://www.wikidata.org/wiki/Q32147067 -an instance of disease that is a subclass of the disease fanconi anemia

For some reason, the wikidata protein bot is giving this UniProt protein, http://www.uniprot.org/uniprot/O15360 the label of the 'Involvement in Disease' claim on the Uniprot record, as the main label on the WD item.

Here is the diff from when it happened.

Also, the gene has an encodes link to the unreviewed protein http://www.uniprot.org/uniprot/H3BQX1.

I don't know how widespread the issue is for other diseases but here is a query that returns 10 human 'proteins' that have 'fanconi anemia complement group' in the label

Cheers

stuppie commented 6 years ago

Hey Tim, The protein names come from Entrez, at this time, which has FANCA titled "Fanconi anemia complementation group A". It has "Fanconi anemia group A protein" as an alias.

For some reason, the wikidata protein bot is giving this UniProt protein, http://www.uniprot.org/uniprot/O15360 the label of the 'Involvement in Disease' claim on the Uniprot record, as the main label on the WD item.

Not sure what you mean?

Also, the gene has an encodes link to the unreviewed protein http://www.uniprot.org/uniprot/H3BQX1.

Ya, I've removed a couple thousand of these already, and still need to remove the others. See https://github.com/SuLab/GeneWikiCentral/issues/18

putmantime commented 6 years ago

Sorry for the cryptic and complicated explanation. It's really just a confusing (maybe just for me) label issue and doesn't appear to be a content issue.

'Fanconi anemia complementation group (N)' is the naming pattern for subclasses of Fanconi anemia the disease, as in the items in this query. They are sourced from OMIM and DO.

The protein reference on the WD item is the uniprot record http://www.uniprot.org/uniprot/O15360 and that record has the label "Fanconi anemia group A protein" (that is what the wd item used to be labeled). On that uniprot record in uniprot there is an 'Involvement in Disease' section with the label "Fanconi anemia complementation group A".

I didn't realize protein bot was using the gene label from NCBI, so now I see where that comes from. However, the NCBI protein records seem to agree with uniprot.

tl;dr There are several proteins and diseases with the same label and it seems like NCBI's gene label is different from the protein label. Not a big deal, just a tad confusing and at first looked like a bunch of duplicate items.

stuppie commented 6 years ago

Ya, there are other instances where the Uniprot name is better than the Entrez name, but as of now, the data is coming from MyGene.info, which doesn't have data from Uniprot. There isn't an agreed up name for these anyways.. HGNC has Fanconi anemia complementation group A. I agree, it would be better to have the Uniprot labels, but we can't now unless we start using Uniprot directly