Open stuppie opened 6 years ago
@newgene this issue would require to adjust ID conversion in uniprot parser. Currently it tries to convert uniprot_acc to entrez ID, or if not possible, Ensembl ID. But if none of them are available the document is skipped. Probably some fix around this: https://github.com/biothings/mygene.info/blob/master/src/hub/dataload/sources/uniprot/parser.py#L53. What do you think ?
We need to give more thoughts on this one. Supposedly MyGene.info is all about genes, if not a gene, no record in MyGene.info. But I agree, including those uniprot IDs is useful, as genes and proteins are often so tied together. With no associated gene ID for a protein, it just means the corresponding gene has not be identified yet, but there should be a gene somewhere in the genome encoding this protein.
With this in mind, I am not against the idea of giving a "fake" gene id place-holder for a document, and put the corresponding uniprot ID within this document (so that this uniprot ID will be searchable).
One way of making this "fake" gene id is like this:
"_id": "NO_GENE_ID_FOR_A2NXD2"
This expands the gene _id priority list to three tier: NCBI Gene ID-->Ensembl Gene ID-->NO_GENE_ID for Uniprot-only gene.
Your opinions? @stuppie @sirloon @cyrus0824 @andrewsu
I think it would be useful for mygene to also store information about proteins with no associated Entrez record. For example: http://www.uniprot.org/uniprot/A2NXD2 http://www.uniprot.org/uniprot/Q5NV61