Genes with multiple chromosomes

dhimmel commented 7 years ago

What does it mean for a gene to have multiple chromosomes? Here are all the genes from genes.tsv that exhibited multiple chromosomes:

entrez_gene_id	symbol	description	chromosome	gene_type	synonyms
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	X	Y	pseudo	AMD	AMD2	AMDP1	AMDPX	AMDPY
293	SLC25A6	solute carrier family 25 member 6	X	Y	protein-coding	AAC3	ANT	ANT 2	ANT 3	ANT3	ANT3Y
438	ASMT	acetylserotonin O-methyltransferase	X	Y	protein-coding	ASMTY	HIOMT	HIOMTY
1438	CSF2RA	colony stimulating factor 2 receptor alpha subunit	X	Y	protein-coding	CD116	CDw116	CSF2R	CSF2RAX	CSF2RAY	CSF2RX	CSF2RY	GM-CSF-R-alpha	GMCSFR	GMR	SMDP4
3563	IL3RA	interleukin 3 receptor subunit alpha	X	Y	protein-coding	CD123	IL3R	IL3RAY	IL3RX	IL3RY	hIL-3Ra
3581	IL9R	interleukin 9 receptor	X	Y	protein-coding	CD129	IL-9R
4267	CD99	CD99 molecule	X	Y	protein-coding	HBA71	MIC2	MIC2X	MIC2Y	MSK5X
6473	SHOX	short stature homeobox	X	Y	protein-coding	GCFX	PHOG	SHOXY	SS
6845	VAMP7	vesicle associated membrane protein 7	X	Y	protein-coding	SYBL1	TI-VAMP	TIVAMP	VAMP-7
7501	XGR	XG and CD99 regulator	X	Y	other	YG
8225	GTPBP6	GTP binding protein 6 (putative)	X	Y	protein-coding	PGPL
8227	AKAP17A	A-kinase anchoring protein 17A	X	Y	protein-coding	721P	AKAP-17A	CCDC133	CXYorf3	DXYS155E	PRKA17A	SFRS17A	XE7	XE7Y
8623	ASMTL	acetylserotonin O-methyltransferase-like	X	Y	protein-coding	ASMTLX	ASMTLY	ASTML
9189	ZBED1	zinc finger BED-type containing 1	X	Y	protein-coding	ALTE	DREF	TRAMP	hDREF
10251	SPRY3	sprouty RTK signaling antagonist 3	X	Y	protein-coding	spry-3
28227	PPP2R3B	protein phosphatase 2 regulatory subunit B''beta	X	Y	protein-coding	NYREN8	PPP2R3L	PPP2R3LY	PR48
55344	PLCXD1	phosphatidylinositol specific phospholipase C X domain containing 1	X	Y	protein-coding	LL0XNC01-136G2.1
64109	CRLF2	cytokine receptor-like factor 2	X	Y	protein-coding	CRL2	CRLF2Y	TSLPR
80161	ASMTL-AS1	ASMTL antisense RNA 1	X	Y	ncRNA	ASMTL-AS	ASMTLAS	CXYorf2	NCRNA00105
207063	DHRSX	dehydrogenase/reductase X-linked	X	Y	protein-coding	CXorf11	DHRS5X	DHRS5Y	DHRSXY	DHRSY	SDR46C1	SDR7C6
283981	LINC00685	long intergenic non-protein coding RNA 685	X	Y	ncRNA	CXYorf10	NCRNA00107	PPP2R3B-AS1
286530	P2RY8	purinergic receptor P2Y8	X	Y	protein-coding	P2Y8
401577	CD99P1	CD99 molecule pseudogene 1	X	Y	pseudo	CD99L1	CXYorf12	MIC2R	NCRNA00103
442442	RPL14P5	ribosomal protein L14 pseudogene 5	X	Y	pseudo
619538	OMS	otitis media, susceptibility to	10	19	3	unknown	COME/ROM
644218	TRPC6P	transient receptor potential cation channel subfamily C member 6, pseudogene	X	Y	pseudo	TRPC6L
652608	LOC652608	60S ribosomal protein L6-like	X	Y	pseudo
653440	WASH6P	WAS protein family homolog 6 pseudogene	X	Y	pseudo	CXYorf1	FAM39A	WASH
727856	DDX11L16	DEAD/H-box helicase 11 like 16	X	Y	pseudo
751580	LINC00106	long intergenic non-protein coding RNA 106	X	Y	ncRNA	CXYorf8	NCRNA00106
100128260	WASIR1	WASH and IL9R antisense RNA 1	X	Y	ncRNA	NCRNA00286B
100287692	TCEB1P24	transcription elongation factor B subunit 1 pseudogene 24	X	Y	pseudo	TCEB1P25
100359394	LINC00102	long intergenic non-protein coding RNA 102	X	Y	ncRNA	NCRNA00102
100418703	LOC100418703	repetin pseudogene	X	Y	pseudo
100500894	MIR3690	microRNA 3690	X	Y	ncRNA	MIR3690-1	MIR3690-2	hsa-mir-3690-1	hsa-mir-3690-2	mir-3690-1	mir-3690-2
101928032	LOC101928032	uncharacterized LOC101928032	X	Y	ncRNA
101928055	LOC101928055	uncharacterized LOC101928055	X	Y	ncRNA
101928070	LOC101928070	uncharacterized LOC101928070	X	Y	ncRNA
101928092	LOC101928092	uncharacterized LOC101928092	X	Y	ncRNA
102464837	MIR6089	microRNA 6089	X	Y	ncRNA	MIR6089-1	MIR6089-2	hsa-mir-6089-1	hsa-mir-6089-2
102724521	LOC102724521	uncharacterized LOC102724521	X	Y	ncRNA
102725051	LOC102725051	uncharacterized LOC102725051	1	Un	ncRNA
105373102	LOC105373102	uncharacterized LOC105373102	X	Y	protein-coding
105373105	LOC105373105	uncharacterized LOC105373105	X	Y	ncRNA
105379413	LOC105379413	uncharacterized LOC105379413	X	Y	ncRNA
105379414	LOC105379414	uncharacterized LOC105379414	X	Y	ncRNA
105379561	LOC105379561	uncharacterized LOC105379561	8	Un	protein-coding
106478924	DHRSX-IT1	DHRSX intronic transcript 1	X	Y	ncRNA	DHRSX-IT	DHRSXIT1
106478926	DPH3P2	diphthamide biosynthesis 3 pseudogene 2	X	Y	pseudo
106480712	FABP5P13	fatty acid binding protein 5 pseudogene 13	X	Y	pseudo	FABP5L13
106480770	RNA5SP498	RNA, 5S ribosomal pseudogene 498	X	Y	pseudo	RN5S498
107985637	LOC107985637	uncharacterized LOC107985637	X	Y	ncRNA
107985677	LOC107985677	uncharacterized LOC107985677	X	Y	ncRNA
107985697	LOC107985697	uncharacterized LOC107985697	X	Y	ncRNA
107985706	LOC107985706	uncharacterized LOC107985706	X	Y	ncRNA

dhimmel commented 7 years ago

Do we want to split these into multiple records when creating chromosome-symbol-mapper.tsv?

cgreene commented 7 years ago

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

OMS looks like a susceptibility "gene." It's not really a molecular entity, just a set of association signal regions: https://www.ncbi.nlm.nih.gov/gene/?term=619538 . This could be dropped.

The others appear to be on unplaced scaffolds: https://www.ncbi.nlm.nih.gov/gene/?term=105379561 For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

dhimmel commented 7 years ago

The X|Y ones are in the pseudoautosomal regions of the X and Y chromosomes. I would not be worried about those and would not split them. These should be retained.

@cgreene, we're including this file to map PANCAN_mutation (as has been done by https://github.com/cognoma/cancer-data/pull/12). Therefore I looked how several of the pseudoautosomal genes were coded in that dataset:

sample	chr	start	end	reference	alt	gene	effect	DNA_VAF	RNA_VAF	Amino_Acid_ChangeTCGA-BH-A18P-01
TCGA-BH-A18P-01	chrX	1508405	1508405	G	A	SLC25A6	Silent		p.F109
TCGA-06-5416-01	chrX	1746629	1746629	C	T	ASMT	Silent	0.276457883369
TCGA-CD-A4MI-01	chrX	1413254	1413254	G	A	CSF2RA	Missense_Mutation			p.R227H

So unless we split chromosomes, these genes will not map. I propose splitting with an optional step to include the unsplit rows. Therefore the top row would yield:

entrez_gene_id	symbol	description	chromosome	gene_type	synonyms
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	X	pseudo	AMD	AMD2	AMDP1	AMDPX	AMDPY
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	Y	pseudo	AMD	AMD2	AMDP1	AMDPX	AMDPY
263	AMD1P2	adenosylmethionine decarboxylase 1 pseudogene 2	X	Y	pseudo	AMD	AMD2	AMDP1	AMDPX	AMDPY

Do you think we should even keep the last row?

dhimmel commented 7 years ago

By the way - if someone picks a gene on the X or Y chromosomes other than those in the X|Y set, you may want to automatically detect it and build separate male and female classifiers. This is a strong signal in expression data, even for unsupervised learning.

May want to open an issue in machine-learning.

dhimmel commented 7 years ago

The others appear to be on unplaced scaffolds. For now, I would probably drop these too, though it's not as clear that these should be dropped as it is for something like OMS.

Okay leaving these in will in effect drop them because the resource being mapped won't have that symbol-chromosome combination. No need to explicitly filter.

cgreene commented 7 years ago

@dhimmel : for the purposes of having a resource to connect potential symbols with chromosomes, I think that retaining at least the first two lines would make the most sense. Maybe the third - I don't know how many resources use X|Y for these regions. I don't see the harm in it, so I guess my inclination would be to leave it as well.

dhimmel commented 7 years ago

@cgreene in b64fcb4261005cf717d5d5ef4e03540e4a1f361e I retained all three lines.

However, there is another issue -- some genes have no chromosome. For example:

These genes all have type unknown, so I'm guessing the inability to map them will not be a big deal. In fact they most likely won't be in our datasets?

cgreene commented 7 years ago

These - to my knowledge - come from the expectation that there exists a gene for the disease but nobody has found it. They aren't really meaningful molecular entities and expect that you won't see them in practice.

cognoma / genes

Genes with multiple chromosomes #2