Closed Colossus closed 9 years ago
Agreed: The entity link here (deciding to link to a single one) is easy. I'd assume there is some meaning to that field, and I'd hate to lose information... however, you can produce a view that collapses them
@colossus: it is important for you to understand why this is happening - the crux lies in the genomics. There is one clue on my notes from tonight. Another clue from a TNF query of the ucsc browser is pasted below. A third comes from the list of hg19 chromosomes: http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=13095744&chromInfoPage=
Understanding this will give you the right way to collate duplicate gene entities (at the gene level). Let me know if you cannot figure this out.
TNF (uc011jjy.2) at chr6_ssto_hap7:2875093-2876913 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011jjx.3) at chr6_ssto_hap7:2874145-2876913 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011iol.2) at chr6_qbl_hap6:2837932-2839752 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011iok.3) at chr6_qbl_hap6:2836982-2839752 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011hrb.2) at chr6_mcf_hap5:2923998-2925818 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011hra.3) at chr6_mcf_hap5:2923050-2925818 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gyl.2) at chr6_mann_hap4:2887176-2888996 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gyk.3) at chr6_mann_hap4:2886228-2888996 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gcc.2) at chr6_dbb_hap3:2829833-2831653 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011gcb.3) at chr6_dbb_hap3:2828885-2831653 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011fef.2) at chr6_cox_hap2:3053908-3055728 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011fee.3) at chr6_cox_hap2:3052960-3055728 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc011emr.2) at chr6_apd_hap1:2859027-2860830 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc003nuj.3) at chr6:31544292-31546112 - Homo sapiens tumor necrosis factor (TNF), mRNA. TNF (uc003nui.4) at chr6:31543344-31546112 - Homo sapiens tumor necrosis factor (TNF), mRNA.
-Gill
On Aug 27, 2015, at 7:48 AM, Colossus notifications@github.com wrote:
E.g., in our genepheno association list, we have:
HP:0001370 Rheumatoid arthritis ENSG00000204490:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000206439:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000223952:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228321:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228849:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228978:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000230108:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000232810:TNF 770
(This is kinda the worst case though.)
We should get rid of those. It will presumably cut down our GP count a bit, but these different ENSG identifiers are just not useful. Perhaps pick one representative ENSG identifier and link it to all other ENSG IDs for the same gene, if that's possible.
— Reply to this email directly or view it on GitHub.
(looks like @gbgbg and I are processing this in parallel)
For the non-genomicists:
The TNF gene is present in the region of chr6 that has multiple representations in the hg19 assembly. You see 8 different ENSG identifiers for TNF, 1 for the version of chr6 proper and 7 for the versions on each of the 7 chr6 haplotype chromosomes.
When you pick a canonical identifier, please pick the version from the proper chromosome, not the haplotype chromosomes.
It is slightly worse than parallel - @collosus, this is in my notes from 2am. Please go over those first if you can. If you find them too cryptic I can dump them all as issues on DDG. I thought some belonged in dashboard git but couldn't find such an entity on neither hazyresearch nor @netj 's repos. I may have missed it.
Gill, you should start writing my name like "@Colossus", otherwise this other guy is gonna constantly get messages from our threads
E.g., in our genepheno association list, we have:
HP:0001370 Rheumatoid arthritis ENSG00000204490:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000206439:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000223952:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228321:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228849:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000228978:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000230108:TNF 770 HP:0001370 Rheumatoid arthritis ENSG00000232810:TNF 770
(This is kinda the worst case though.)
We should get rid of those. It will presumably cut down our GP count a bit, but these different ENSG identifiers are just not useful. Perhaps pick one representative ENSG identifier and link it to all other ENSG IDs for the same gene, if that's possible.