biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
116 stars 20 forks source link

Proteins with no genes #31

Open stuppie opened 6 years ago

stuppie commented 6 years ago

I think it would be useful for mygene to also store information about proteins with no associated Entrez record. For example: http://www.uniprot.org/uniprot/A2NXD2 http://www.uniprot.org/uniprot/Q5NV61

sirloon commented 6 years ago

@newgene this issue would require to adjust ID conversion in uniprot parser. Currently it tries to convert uniprot_acc to entrez ID, or if not possible, Ensembl ID. But if none of them are available the document is skipped. Probably some fix around this: https://github.com/biothings/mygene.info/blob/master/src/hub/dataload/sources/uniprot/parser.py#L53. What do you think ?

newgene commented 6 years ago

We need to give more thoughts on this one. Supposedly MyGene.info is all about genes, if not a gene, no record in MyGene.info. But I agree, including those uniprot IDs is useful, as genes and proteins are often so tied together. With no associated gene ID for a protein, it just means the corresponding gene has not be identified yet, but there should be a gene somewhere in the genome encoding this protein.

With this in mind, I am not against the idea of giving a "fake" gene id place-holder for a document, and put the corresponding uniprot ID within this document (so that this uniprot ID will be searchable).

One way of making this "fake" gene id is like this:

"_id": "NO_GENE_ID_FOR_A2NXD2"

This expands the gene _id priority list to three tier: NCBI Gene ID-->Ensembl Gene ID-->NO_GENE_ID for Uniprot-only gene.

Your opinions? @stuppie @sirloon @cyrus0824 @andrewsu