HazyResearch / dd-genomics

The Genomics DeepDive project
Apache License 2.0
11 stars 6 forks source link

Many genes are counted multiple times because they have different Ensembl IDs #153

Closed Colossus closed 9 years ago

Colossus commented 9 years ago

E.g., in our genepheno association list, we have:

HP:0001370 Rheumatoid arthritis ENSG00000204490:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000206439:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000223952:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228321:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228849:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228978:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000230108:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000232810:TNF 770

(This is kinda the worst case though.)

We should get rid of those. It will presumably cut down our GP count a bit, but these different ENSG identifiers are just not useful. Perhaps pick one representative ENSG identifier and link it to all other ENSG IDs for the same gene, if that's possible.

chrismre commented 9 years ago

Agreed: The entity link here (deciding to link to a single one) is easy. I'd assume there is some meaning to that field, and I'd hate to lose information... however, you can produce a view that collapses them

gbgbg commented 9 years ago

@colossus: it is important for you to understand why this is happening - the crux lies in the genomics. There is one clue on my notes from tonight. Another clue from a TNF query of the ucsc browser is pasted below. A third comes from the list of hg19 chromosomes: http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=13095744&chromInfoPage=

Understanding this will give you the right way to collate duplicate gene entities (at the gene level). Let me know if you cannot figure this out.

TNF (uc011jjy.2) at chr6_ssto_hap7:2875093-2876913 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011jjx.3) at chr6_ssto_hap7:2874145-2876913 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011iol.2) at chr6_qbl_hap6:2837932-2839752 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011iok.3) at chr6_qbl_hap6:2836982-2839752 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011hrb.2) at chr6_mcf_hap5:2923998-2925818 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011hra.3) at chr6_mcf_hap5:2923050-2925818 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gyl.2) at chr6_mann_hap4:2887176-2888996 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gyk.3) at chr6_mann_hap4:2886228-2888996 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gcc.2) at chr6_dbb_hap3:2829833-2831653 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gcb.3) at chr6_dbb_hap3:2828885-2831653 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011fef.2) at chr6_cox_hap2:3053908-3055728 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011fee.3) at chr6_cox_hap2:3052960-3055728 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011emr.2) at chr6_apd_hap1:2859027-2860830 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc003nuj.3) at chr6:31544292-31546112 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc003nui.4) at chr6:31543344-31546112 - Homo sapiens tumor necrosis factor (TNF), mRNA.

-Gill

On Aug 27, 2015, at 7:48 AM, Colossus notifications@github.com wrote:

E.g., in our genepheno association list, we have:

HP:0001370 Rheumatoid arthritis ENSG00000204490:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000206439:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000223952:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228321:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228849:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228978:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000230108:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000232810:TNF 770

(This is kinda the worst case though.)

We should get rid of those. It will presumably cut down our GP count a bit, but these different ENSG identifiers are just not useful. Perhaps pick one representative ENSG identifier and link it to all other ENSG IDs for the same gene, if that's possible.

— Reply to this email directly or view it on GitHub.

amwenger commented 9 years ago

(looks like @gbgbg and I are processing this in parallel)

For the non-genomicists:

The TNF gene is present in the region of chr6 that has multiple representations in the hg19 assembly. You see 8 different ENSG identifiers for TNF, 1 for the version of chr6 proper and 7 for the versions on each of the 7 chr6 haplotype chromosomes.

When you pick a canonical identifier, please pick the version from the proper chromosome, not the haplotype chromosomes.

gbgbg commented 9 years ago

It is slightly worse than parallel - @collosus, this is in my notes from 2am. Please go over those first if you can. If you find them too cryptic I can dump them all as issues on DDG. I thought some belonged in dashboard git but couldn't find such an entity on neither hazyresearch nor @netj 's repos. I may have missed it.

Colossus commented 9 years ago

Gill, you should start writing my name like "@Colossus", otherwise this other guy is gonna constantly get messages from our threads