cognoma / genes

Genes for Project Cognoma
Other
1 stars 5 forks source link

Genes with multiple chromosomes #2

Closed dhimmel closed 7 years ago

dhimmel commented 7 years ago

What does it mean for a gene to have multiple chromosomes? Here are all the genes from genes.tsv that exhibited multiple chromosomes:

entrez_gene_id symbol description chromosome gene_type synonyms
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X Y pseudo AMD AMD2 AMDP1 AMDPX AMDPY
293 SLC25A6 solute carrier family 25 member 6 X Y protein-coding AAC3 ANT ANT 2 ANT 3 ANT3 ANT3Y
438 ASMT acetylserotonin O-methyltransferase X Y protein-coding ASMTY HIOMT HIOMTY
1438 CSF2RA colony stimulating factor 2 receptor alpha subunit X Y protein-coding CD116 CDw116 CSF2R CSF2RAX CSF2RAY CSF2RX CSF2RY GM-CSF-R-alpha GMCSFR GMR SMDP4
3563 IL3RA interleukin 3 receptor subunit alpha X Y protein-coding CD123 IL3R IL3RAY IL3RX IL3RY hIL-3Ra
3581 IL9R interleukin 9 receptor X Y protein-coding CD129 IL-9R
4267 CD99 CD99 molecule X Y protein-coding HBA71 MIC2 MIC2X MIC2Y MSK5X
6473 SHOX short stature homeobox X Y protein-coding GCFX PHOG SHOXY SS
6845 VAMP7 vesicle associated membrane protein 7 X Y protein-coding SYBL1 TI-VAMP TIVAMP VAMP-7
7501 XGR XG and CD99 regulator X Y other YG
8225 GTPBP6 GTP binding protein 6 (putative) X Y protein-coding PGPL
8227 AKAP17A A-kinase anchoring protein 17A X Y protein-coding 721P AKAP-17A CCDC133 CXYorf3 DXYS155E PRKA17A SFRS17A XE7 XE7Y
8623 ASMTL acetylserotonin O-methyltransferase-like X Y protein-coding ASMTLX ASMTLY ASTML
9189 ZBED1 zinc finger BED-type containing 1 X Y protein-coding ALTE DREF TRAMP hDREF
10251 SPRY3 sprouty RTK signaling antagonist 3 X Y protein-coding spry-3
28227 PPP2R3B protein phosphatase 2 regulatory subunit B''beta X Y protein-coding NYREN8 PPP2R3L PPP2R3LY PR48
55344 PLCXD1 phosphatidylinositol specific phospholipase C X domain containing 1 X Y protein-coding LL0XNC01-136G2.1
64109 CRLF2 cytokine receptor-like factor 2 X Y protein-coding CRL2 CRLF2Y TSLPR
80161 ASMTL-AS1 ASMTL antisense RNA 1 X Y ncRNA ASMTL-AS ASMTLAS CXYorf2 NCRNA00105
207063 DHRSX dehydrogenase/reductase X-linked X Y protein-coding CXorf11 DHRS5X DHRS5Y DHRSXY DHRSY SDR46C1 SDR7C6
283981 LINC00685 long intergenic non-protein coding RNA 685 X Y ncRNA CXYorf10 NCRNA00107 PPP2R3B-AS1
286530 P2RY8 purinergic receptor P2Y8 X Y protein-coding P2Y8
401577 CD99P1 CD99 molecule pseudogene 1 X Y pseudo CD99L1 CXYorf12 MIC2R NCRNA00103
442442 RPL14P5 ribosomal protein L14 pseudogene 5 X Y pseudo
619538 OMS otitis media, susceptibility to 10 19 3 unknown COME/ROM
644218 TRPC6P transient receptor potential cation channel subfamily C member 6, pseudogene X Y pseudo TRPC6L
652608 LOC652608 60S ribosomal protein L6-like X Y pseudo
653440 WASH6P WAS protein family homolog 6 pseudogene X Y pseudo CXYorf1 FAM39A WASH
727856 DDX11L16 DEAD/H-box helicase 11 like 16 X Y pseudo
751580 LINC00106 long intergenic non-protein coding RNA 106 X Y ncRNA CXYorf8 NCRNA00106
100128260 WASIR1 WASH and IL9R antisense RNA 1 X Y ncRNA NCRNA00286B
100287692 TCEB1P24 transcription elongation factor B subunit 1 pseudogene 24 X Y pseudo TCEB1P25
100359394 LINC00102 long intergenic non-protein coding RNA 102 X Y ncRNA NCRNA00102
100418703 LOC100418703 repetin pseudogene X Y pseudo
100500894 MIR3690 microRNA 3690 X Y ncRNA MIR3690-1 MIR3690-2 hsa-mir-3690-1 hsa-mir-3690-2 mir-3690-1 mir-3690-2
101928032 LOC101928032 uncharacterized LOC101928032 X Y ncRNA
101928055 LOC101928055 uncharacterized LOC101928055 X Y ncRNA
101928070 LOC101928070 uncharacterized LOC101928070 X Y ncRNA
101928092 LOC101928092 uncharacterized LOC101928092 X Y ncRNA
102464837 MIR6089 microRNA 6089 X Y ncRNA MIR6089-1 MIR6089-2 hsa-mir-6089-1 hsa-mir-6089-2
102724521 LOC102724521 uncharacterized LOC102724521 X Y ncRNA
102725051 LOC102725051 uncharacterized LOC102725051 1 Un ncRNA
105373102 LOC105373102 uncharacterized LOC105373102 X Y protein-coding
105373105 LOC105373105 uncharacterized LOC105373105 X Y ncRNA
105379413 LOC105379413 uncharacterized LOC105379413 X Y ncRNA
105379414 LOC105379414 uncharacterized LOC105379414 X Y ncRNA
105379561 LOC105379561 uncharacterized LOC105379561 8 Un protein-coding
106478924 DHRSX-IT1 DHRSX intronic transcript 1 X Y ncRNA DHRSX-IT DHRSXIT1
106478926 DPH3P2 diphthamide biosynthesis 3 pseudogene 2 X Y pseudo
106480712 FABP5P13 fatty acid binding protein 5 pseudogene 13 X Y pseudo FABP5L13
106480770 RNA5SP498 RNA, 5S ribosomal pseudogene 498 X Y pseudo RN5S498
107985637 LOC107985637 uncharacterized LOC107985637 X Y ncRNA
107985677 LOC107985677 uncharacterized LOC107985677 X Y ncRNA
107985697 LOC107985697 uncharacterized LOC107985697 X Y ncRNA
107985706 LOC107985706 uncharacterized LOC107985706 X Y ncRNA
dhimmel commented 7 years ago

Do we want to split these into multiple records when creating chromosome-symbol-mapper.tsv?

cgreene commented 7 years ago

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

OMS looks like a susceptibility "gene." It's not really a molecular entity, just a set of association signal regions: https://www.ncbi.nlm.nih.gov/gene/?term=619538 . This could be dropped.

The others appear to be on unplaced scaffolds: https://www.ncbi.nlm.nih.gov/gene/?term=105379561 For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

dhimmel commented 7 years ago

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

@cgreene, we're including this file to map PANCAN_mutation (as has been done by https://github.com/cognoma/cancer-data/pull/12). Therefore I looked how several of the pseudoautosomal genes were coded in that dataset:

sample chr start end reference alt gene effect DNA_VAF RNA_VAF Amino_Acid_ChangeTCGA-BH-A18P-01 chrX 1508405 1508405 G A SLC25A6 Silent p.F109
TCGA-BH-A18P-01 chrX 1508405 1508405 G A SLC25A6 Silent p.F109
TCGA-06-5416-01 chrX 1746629 1746629 C T ASMT Silent 0.276457883369
TCGA-CD-A4MI-01 chrX 1413254 1413254 G A CSF2RA Missense_Mutation p.R227H

So unless we split chromosomes, these genes will not map. I propose splitting with an optional step to include the unsplit rows. Therefore the top row would yield:

entrez_gene_id symbol description chromosome gene_type synonyms
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X pseudo AMD AMD2 AMDP1 AMDPX AMDPY
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 Y pseudo AMD AMD2 AMDP1 AMDPX AMDPY
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X Y pseudo AMD AMD2 AMDP1 AMDPX AMDPY

Do you think we should even keep the last row?

dhimmel commented 7 years ago

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

May want to open an issue in machine-learning.

dhimmel commented 7 years ago

The others appear to be on unplaced scaffolds. For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

Okay leaving these in will in effect drop them because the resource being mapped won't have that symbol-chromosome combination. No need to explicitly filter.

cgreene commented 7 years ago

@dhimmel : for the purposes of having a resource to connect potential symbols with chromosomes, I think that retaining at least the first two lines would make the most sense. Maybe the third - I don't know how many resources use X|Y for these regions. I don't see the harm in it, so I guess my inclination would be to leave it as well.

dhimmel commented 7 years ago

@cgreene in b64fcb4261005cf717d5d5ef4e03540e4a1f361e I retained all three lines.

However, there is another issue -- some genes have no chromosome. For example:

These genes all have type unknown, so I'm guessing the inability to map them will not be a big deal. In fact they most likely won't be in our datasets?

cgreene commented 7 years ago

These - to my knowledge - come from the expectation that there exists a gene for the disease but nobody has found it. They aren't really meaningful molecular entities and expect that you won't see them in practice.