Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
150 stars 46 forks source link

Gene lacking transcript from HGVS #570

Closed KickiLagerstedt closed 6 years ago

KickiLagerstedt commented 7 years ago

SRD5A2

NM-transkript finns ej med i genlistan - dvs vi missar mutationer i denna gen (DSD panelen, OMIM-panelen etc)

ALLVARLIGT FEL!

Hur länge har detta varit så? Finns det fler gener med detta fel??

Kan vi lätt reanalysera alla prover avseende DSD???

// Kicki

moonso commented 7 years ago

Hi,

ok this is a problem that we can not do much about, I will try to explain here. We use HGNC as the definition of what genes that exists and what names they have. The problem is that HGNC is only available for human genome build 38 which we NOT use in the analysis pipeline. We map and annotate everything in human genome build 37. We use ensembl to find out what transcripts that belongs to a gene and what NM transcripts they have. So in this case if we look at the gene SRD5A2 in ensemble build 37 we see:

EnsGeneID EnsTransID Transcript Start Transcript End Gene Name RefSeq mRNA
ENSG00000049319 ENST00000405650 31747550 31806136 SRD5A2 -
ENSG00000049319 ENST00000233139 31749656 31805969 SRD5A2 -

In build 38 we see

EnsGeneID EnsTransID Transcript Start Transcript End Gene Name RefSeq mRNA
ENSG00000277893 ENST00000622030 31522480 31581067 SRD5A2 NM_000348

Which means that in build 38 there is only one transcript instead of two and it now has a NM identifier.

What to do

We can not manually map all transcripts between builds 38 and 37, that would not work anyway since tools like VEP that give information on impact (synonymous, non synonymous etc) does not know about the mapping. This means that it is potentially dangerous to use the Clinical filter because we will miss variants.

Best thing would be if we could use genome build 38 all the way, at the moment it is not possible since there are too many annotations that we would loose. They are not ready for 38 yet.

Please ask questions here if anything is unclear. We need to discuss this further in a user meeting soon.

Best,

Måns

dnil commented 7 years ago

:+1: It may give some comfort to know that Alamut has exactly the same issue with SRD5A2 and FGF16. No hg19 version available, although well established disease gene - and refseq cDNA entry available.

4WGH commented 7 years ago

it would be good to know which of the genes in our panels are affected in order to at least take local solution if necessary. Is it possible?

moonso commented 6 years ago

When we first look at all genes we could see the following:

Nr of genes: 33578 Nr without transcripts: 11048

So 1/3 of all genes are missing any refseq transcript. This came out as a high number at first, after some thought one might realise that many of these genes are not protein coding etc.

If we look at disease causing genes from OMIM there are 16 genes that are missing refseq id for any transcript in ensembl build 37, 7 genes in build 38.

The genes in OMIM without refseq transcripts in build 37 are the following:

TTC25
PTPRQ
SRD5A2
PIGY
FGF16
TRAC
PADI6
GDF1
TUBB3
IGHG2
IGHM
IGKC
FCGR2C
KMT2B
NEFL
NR2E3

In the gene panels there are one more gene outside OMIM wihtout refseq transcripts: MAP3K14

We will look at these genes in detail.

moonso commented 6 years ago

TTC25

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 HG185_PATCH ENSG00000260703 TTC25 ENST00000569427 -
37 HG185_PATCH ENSG00000260703 TTC25 ENST00000561994 -
37 HG185_PATCH ENSG00000260703 TTC25 ENST00000569541 NM_031421
37 HG185_PATCH ENSG00000260703 TTC25 ENST00000565052 -
37 17 ENSG00000204815 TTC25 ENST00000593239 -
37 17 ENSG00000204815 TTC25 ENST00000591658 -
37 17 ENSG00000204815 TTC25 ENST00000585530 -
37 17 ENSG00000204815 TTC25 ENST00000377540 -
--- --- --- --- --- ---
38 17 ENSG00000204815 TTC25 ENST00000593239 -
38 17 ENSG00000204815 TTC25 ENST00000591658 -
38 17 ENSG00000204815 TTC25 ENST00000377540 NM_031421
38 17 ENSG00000204815 TTC25 ENST00000585530 -

We can see that this gene have a transcript that is mapped on a patch chromosome in build 37. That have been solved in build 38. Also notice that the RefSeq transcript (ENST00000377540) exists in both builds so no variants should be missed here.

moonso commented 6 years ago

PTPRQ

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 12 ENSG00000139304 PTPRQ ENST00000551042 -
37 12 ENSG00000139304 PTPRQ ENST00000547376 -
37 12 ENSG00000139304 PTPRQ ENST00000551573 -
37 12 ENSG00000139304 PTPRQ ENST00000526956 -
37 12 ENSG00000139304 PTPRQ ENST00000547485 -
37 12 ENSG00000139304 PTPRQ ENST00000551624 -
37 12 ENSG00000139304 PTPRQ ENST00000547881 -
37 12 ENSG00000139304 PTPRQ ENST00000549355 -
37 12 ENSG00000139304 PTPRQ ENST00000266688 -
--- --- --- --- --- ---
38 12 ENSG00000139304 PTPRQ ENST00000623635 -
38 12 ENSG00000139304 PTPRQ ENST00000551042 -
38 12 ENSG00000139304 PTPRQ ENST00000547376 -
38 12 ENSG00000139304 PTPRQ ENST00000551573 -
38 12 ENSG00000139304 PTPRQ ENST00000614701 NM_001145026
38 12 ENSG00000139304 PTPRQ ENST00000547485 -
38 12 ENSG00000139304 PTPRQ ENST00000551624 -
38 12 ENSG00000139304 PTPRQ ENST00000547881 -
38 12 ENSG00000139304 PTPRQ ENST00000549355 -
38 12 ENSG00000139304 PTPRQ ENST00000616559 XM_017019274

Here the refseq identifiers have been added in build 38, the NM-transcript does not seem to exist in 37 so we can not be sure that we are missing variants.

moonso commented 6 years ago

SRD5A2

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 2 ENSG00000049319 SRD5A2 ENST00000405650 -
37 2 ENSG00000049319 SRD5A2 ENST00000233139 -
--- --- --- --- --- ---
38 2 ENSG00000277893 SRD5A2 ENST00000622030 NM_000348

Here we can see that two transcripts has been merged to one and been given a refseq identifier. Most probable one of the old transcripts match the new one. We can not be sure if variants are missed here.

DN: reference genome is missing a couple of bp. A different transcript build is needed to resolve the issue completely. GoldenPath is not good enough.

moonso commented 6 years ago

PIGY

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 4 ENSG00000255072 PIGY ENST00000527353 -
--- --- --- --- --- ---
38 4 ENSG00000255072 PIGY ENST00000527353 -

Here the transcript is identical for both builds so we dear to guess that we do not miss any variants.

moonso commented 6 years ago

FGF16

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 X ENSG00000196468 FGF16 ENST00000439435 -
37 HG1426_PATCH ENSG00000268853 FGF16 ENST00000600602 -
--- --- --- --- --- ---
38 X ENSG00000196468 FGF16 ENST00000439435 NM_003868

The refseq transcript exists in build 37 but is missing the refseq identifier. We would NOT miss any variants here.

DN: The actual exon is missing from the reference genome sequence (just a gap with NNNNNN there) so no amount of transcript fixing will solve this. hg38 or a patch-build is needed.

moonso commented 6 years ago

TRAC

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 14 ENSG00000229164 TRAC ENST00000478163 -
--- --- --- --- --- ---
38 14 ENSG00000277734 TRAC ENST00000611116 -

Transcripts have different ENS ids in both builds but are most probably the same. No refseq identifier for any build.

moonso commented 6 years ago

PADI6

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 1 ENSG00000256049 PADI6 ENST00000434762 -
37 1 ENSG00000256049 PADI6 ENST00000358481 -
--- --- --- --- --- ---
38 1 ENSG00000276747 PADI6 ENST00000619609 NM_207421
38 CHR_HG2095_PATCH ENSG00000280949 PADI6 ENST00000625380 NM_207421

Unclear what happens here...

moonso commented 6 years ago

GDF1

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 19 ENSG00000130283 GDF1 ENST00000247005 -
--- --- --- --- --- ---
38 19 ENSG00000130283 GDF1 ENST00000247005 -

Same transcripts, no refseq in both builds. No variants will be missed here

moonso commented 6 years ago

TUBB3

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 16 ENSG00000198211 TUBB3 ENST00000556922 -
--- --- --- --- --- ---
38 16 ENSG00000258947 TUBB3 ENST00000555810 -
38 16 ENSG00000258947 TUBB3 ENST00000554444 NM_001197181
38 16 ENSG00000258947 TUBB3 ENST00000556565 -
38 16 ENSG00000258947 TUBB3 ENST00000315491 NM_006086
38 16 ENSG00000258947 TUBB3 ENST00000553656 -
38 16 ENSG00000258947 TUBB3 ENST00000556536 -
38 16 ENSG00000258947 TUBB3 ENST00000554116 -
38 16 ENSG00000258947 TUBB3 ENST00000554927 -
38 16 ENSG00000258947 TUBB3 ENST00000557262 -
38 16 ENSG00000258947 TUBB3 ENST00000557490 -
38 16 ENSG00000258947 TUBB3 ENST00000555576 -
38 16 ENSG00000258947 TUBB3 ENST00000554336 -
38 16 ENSG00000258947 TUBB3 ENST00000555609 -
38 16 ENSG00000258947 TUBB3 ENST00000553967 -
38 16 ENSG00000258947 TUBB3 ENST00000625617 -

This is very unclear, it goes from one to many transcripts between the builds. Hard to say if variants are missed here.

moonso commented 6 years ago

IGHG2

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 14 ENSG00000211893 IGHG2 ENST00000390545 -
37 HG1592_PATCH ENSG00000270895 IGHG2 ENST00000603473 -
--- --- --- --- --- ---
38 CHR_HSCHR14_3_CTG1 ENSG00000274497 IGHG2 ENST00000621803 -
38 14 ENSG00000211893 IGHG2 ENST00000641095 -
38 14 ENSG00000211893 IGHG2 ENST00000390545 -

No refseq in any build. Same transcripts in both builds except a new one in a patch. No variants would be missed here.

moonso commented 6 years ago

IGHM

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 14 ENSG00000211899 IGHM ENST00000390559 -
37 HG1592_PATCH ENSG00000271541 IGHM ENST00000605693 -
--- --- --- --- --- ---
37 CHR_HSCHR14_3_CTG1 ENSG00000282657 IGHM ENST00000626472 -
37 14 ENSG00000211899 IGHM ENST00000637539 -
37 14 ENSG00000211899 IGHM ENST00000390559 -

Same as above, no variants missing here

moonso commented 6 years ago

IGKC

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 2 ENSG00000211592 IGKC ENST00000390237 -
--- --- --- --- --- ---
38 2 ENSG00000211592 IGKC ENST00000390237 -

Same in both builds. No variants would be missing here

moonso commented 6 years ago

FCGR2C

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 1 ENSG00000244682 FCGR2C ENST00000502411 -
37 1 ENSG00000244682 FCGR2C ENST00000496692 -
37 1 ENSG00000244682 FCGR2C ENST00000466542 -
37 1 ENSG00000244682 FCGR2C ENST00000465075 -
37 1 ENSG00000244682 FCGR2C ENST00000473530 -
37 1 ENSG00000244682 FCGR2C ENST00000473712 -
37 1 ENSG00000244682 FCGR2C ENST00000482226 -
37 1 ENSG00000244682 FCGR2C ENST00000467903 -
37 1 ENSG00000244682 FCGR2C ENST00000507374 -
37 1 ENSG00000244682 FCGR2C ENST00000508651 -
37 1 ENSG00000244682 FCGR2C ENST00000543859 -
--- --- --- --- --- ---
38 1 ENSG00000244682 FCGR2C ENST00000502411 -
38 1 ENSG00000244682 FCGR2C ENST00000496692 -
38 1 ENSG00000244682 FCGR2C ENST00000466542 -
38 1 ENSG00000244682 FCGR2C ENST00000465075 -
38 1 ENSG00000244682 FCGR2C ENST00000473530 -
38 1 ENSG00000244682 FCGR2C ENST00000473712 -
38 1 ENSG00000244682 FCGR2C ENST00000482226 -
38 1 ENSG00000244682 FCGR2C ENST00000467903 -
38 1 ENSG00000244682 FCGR2C ENST00000507374 -
38 1 ENSG00000244682 FCGR2C ENST00000508651 -
38 1 ENSG00000244682 FCGR2C ENST00000611236 -
38 1 ENSG00000244682 FCGR2C ENST00000543859 NR_047648

Most probably no variants would be missing here. All transcripts seems to be correct between the builds.

DN: Polymorphic gene w pseudogenes. Likely a lot of multimapping reads.

moonso commented 6 years ago

KMT2B

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 19 ENSG00000105663 KMT2B ENST00000606995 -
37 19 ENSG00000105663 KMT2B ENST00000607650 -
37 19 ENSG00000105663 KMT2B ENST00000592092 -
37 19 ENSG00000105663 KMT2B ENST00000585476 -
37 19 ENSG00000105663 KMT2B ENST00000586308 -
--- --- --- --- --- ---
38 19 ENSG00000272333 KMT2B ENST00000420124 NM_014727
38 19 ENSG00000272333 KMT2B ENST00000606995 -
38 19 ENSG00000272333 KMT2B ENST00000592092 -
38 19 ENSG00000272333 KMT2B ENST00000585476 -
38 19 ENSG00000272333 KMT2B ENST00000586308 -

Here the transcript ENST00000607650 in build 37 have probably changed id to ENST00000420124 and been given a refseq identifier.

moonso commented 6 years ago

NEFL

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 8 ENSG00000104725 NEFL ENST00000221169 -
--- --- --- --- --- ---
38 8 ENSG00000277586 NEFL ENST00000610854 NM_006158
38 8 ENSG00000277586 NEFL ENST00000615973 -
38 8 ENSG00000277586 NEFL ENST00000639464 -
38 8 ENSG00000277586 NEFL ENST00000619417 -

This is unclear, probably ENST00000221169 has been renamed to ENST00000610854 and been given a refseq identifier.

DN: looks reasonably ok on UCSC RefSeq GoldenPath NM_006158 - single base intron might be the issue.

moonso commented 6 years ago

NR2E3

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 15 ENSG00000031544 NR2E3 ENST00000561604 -
37 15 ENSG00000031544 NR2E3 ENST00000567496 -
37 15 ENSG00000031544 NR2E3 ENST00000562839 -
37 15 ENSG00000031544 NR2E3 ENST00000562925 -
37 15 ENSG00000031544 NR2E3 ENST00000563709 -
37 15 ENSG00000031544 NR2E3 ENST00000398840 -
37 15 ENSG00000031544 NR2E3 ENST00000326995 -
--- --- --- --- --- ---
38 15 ENSG00000278570 NR2E3 ENST00000621736 -
38 15 ENSG00000278570 NR2E3 ENST00000617575 NM_014249
38 15 ENSG00000278570 NR2E3 ENST00000621098 NM_016346
38 15 ENSG00000278570 NR2E3 ENST00000563709 -

This is unclear. We might miss variants here

moonso commented 6 years ago

MAP3K14

Build Chrom EnsGeneID Gene Name EnsTransID RefSeq mRNA
37 17 ENSG00000006062 MAP3K14 ENST00000587332 -
37 17 ENSG00000006062 MAP3K14 ENST00000592267 -
37 17 ENSG00000006062 MAP3K14 ENST00000586644 -
37 17 ENSG00000006062 MAP3K14 ENST00000344686 -
37 17 ENSG00000006062 MAP3K14 ENST00000376926 -
--- --- --- --- --- ---
38 17 ENSG00000006062 MAP3K14 ENST00000344686 NM_003954
38 17 ENSG00000006062 MAP3K14 ENST00000640087 -
38 17 ENSG00000006062 MAP3K14 ENST00000592267 -
38 17 ENSG00000006062 MAP3K14 ENST00000586644 -
38 17 ENSG00000006062 MAP3K14 ENST00000617331 -
38 17 ENSG00000006062 MAP3K14 ENST00000376926 -
38 CHR_HSCHR17_2_CTG5 ENSG00000282637 MAP3K14 ENST00000633437 -
38 CHR_HSCHR17_2_CTG5 ENSG00000282637 MAP3K14 ENST00000634016 -

Here one of the transcripts have been given a refseq identifier in build 38, that transcript exists in build 37 so no variants would be missed here.