biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
115 stars 20 forks source link

Wrong gene coordinate for UBE2V2 #48

Closed pwwang closed 6 years ago

pwwang commented 6 years ago

http://mygene.info/v3/gene/7336

"genomic_pos_hg19": [
    {
      "chr": "20",
      "end": 48732496,
      "start": 48697661,
      "strand": -1
    },
    {
      "chr": "8",
      "end": 48977268,
      "start": 48920960,
      "strand": 1
    }
  ]

The right coordinate should be the second one. No sure where is the first one from.

newgene commented 6 years ago

@pwwang this "genomic_pos_hg19" field is based on the last Ensembl release on GRCh37 (hg19). It's most likely there is a pseudogene in that release, and it's also mapped to Entrez Gene 7336. So you will see two positions for gene 7336.

Since Ensembl release was switched to GRCh38, "genomic_pos_hg19" values were kept the same. Instead, "genomic_pos" field (based on GRCh38) are always updated.

We will do a round of sanity check on "genomic_pos_hg19" field and remove incorrect positions as much as we can. E.g. if that additional Ensembl gene was not mapped to gene 7336 in the current Ensembl release, we will remove the position value from "genomic_pos_hg19" as well.

pwwang commented 6 years ago

@newgene Thanks for the reply. That sounds good. Probably switching to GRCh38 would help.

sirloon commented 6 years ago

1st column: gene ID 2nd column ( with "|" as separator): chromosomes found in genomic_pos_hg19 (only when it's a list) 3rd column: chromosomes found in genomic_pos (hg38) if any

poshg19hg38.txt

newgene commented 6 years ago

After looking at the list @sirloon posted above, we can filter down to a list of 714 rows, which could be "fixed" based on their hg38 pos (when hg38 position contains position from a single "chr" value, and that chr appears only once in hg19 position):

poshg19hg38_fixable_by_hg38.txt

However, I would like to hold off this fix for now. Ideally, we should get those hg19 genomic pos data from NCBI, then this issue will be fixed from the data source.

I'm closing this issue for now, with the reference to this new issue I just created #50.