Closed KickiLagerstedt closed 6 years ago
Hi,
ok this is a problem that we can not do much about, I will try to explain here. We use HGNC as the definition of what genes that exists and what names they have. The problem is that HGNC is only available for human genome build 38 which we NOT use in the analysis pipeline. We map and annotate everything in human genome build 37. We use ensembl to find out what transcripts that belongs to a gene and what NM transcripts they have. So in this case if we look at the gene SRD5A2 in ensemble build 37 we see:
EnsGeneID | EnsTransID | Transcript Start | Transcript End | Gene Name | RefSeq mRNA |
---|---|---|---|---|---|
ENSG00000049319 | ENST00000405650 | 31747550 | 31806136 | SRD5A2 | - |
ENSG00000049319 | ENST00000233139 | 31749656 | 31805969 | SRD5A2 | - |
In build 38 we see
EnsGeneID | EnsTransID | Transcript Start | Transcript End | Gene Name | RefSeq mRNA |
---|---|---|---|---|---|
ENSG00000277893 | ENST00000622030 | 31522480 | 31581067 | SRD5A2 | NM_000348 |
Which means that in build 38 there is only one transcript instead of two and it now has a NM identifier.
We can not manually map all transcripts between builds 38 and 37, that would not work anyway since tools like VEP that give information on impact (synonymous, non synonymous etc) does not know about the mapping. This means that it is potentially dangerous to use the Clinical filter
because we will miss variants.
Best thing would be if we could use genome build 38 all the way, at the moment it is not possible since there are too many annotations that we would loose. They are not ready for 38 yet.
Please ask questions here if anything is unclear. We need to discuss this further in a user meeting soon.
Best,
Måns
:+1: It may give some comfort to know that Alamut has exactly the same issue with SRD5A2 and FGF16. No hg19 version available, although well established disease gene - and refseq cDNA entry available.
it would be good to know which of the genes in our panels are affected in order to at least take local solution if necessary. Is it possible?
When we first look at all genes we could see the following:
Nr of genes: 33578
Nr without transcripts: 11048
So 1/3 of all genes are missing any refseq transcript. This came out as a high number at first, after some thought one might realise that many of these genes are not protein coding etc.
If we look at disease causing genes from OMIM there are 16 genes that are missing refseq id for any transcript in ensembl build 37, 7 genes in build 38.
The genes in OMIM without refseq transcripts in build 37 are the following:
TTC25
PTPRQ
SRD5A2
PIGY
FGF16
TRAC
PADI6
GDF1
TUBB3
IGHG2
IGHM
IGKC
FCGR2C
KMT2B
NEFL
NR2E3
In the gene panels there are one more gene outside OMIM wihtout refseq transcripts: MAP3K14
We will look at these genes in detail.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | HG185_PATCH | ENSG00000260703 | TTC25 | ENST00000569427 | - |
37 | HG185_PATCH | ENSG00000260703 | TTC25 | ENST00000561994 | - |
37 | HG185_PATCH | ENSG00000260703 | TTC25 | ENST00000569541 | NM_031421 |
37 | HG185_PATCH | ENSG00000260703 | TTC25 | ENST00000565052 | - |
37 | 17 | ENSG00000204815 | TTC25 | ENST00000593239 | - |
37 | 17 | ENSG00000204815 | TTC25 | ENST00000591658 | - |
37 | 17 | ENSG00000204815 | TTC25 | ENST00000585530 | - |
37 | 17 | ENSG00000204815 | TTC25 | ENST00000377540 | - |
--- | --- | --- | --- | --- | --- |
38 | 17 | ENSG00000204815 | TTC25 | ENST00000593239 | - |
38 | 17 | ENSG00000204815 | TTC25 | ENST00000591658 | - |
38 | 17 | ENSG00000204815 | TTC25 | ENST00000377540 | NM_031421 |
38 | 17 | ENSG00000204815 | TTC25 | ENST00000585530 | - |
We can see that this gene have a transcript that is mapped on a patch chromosome in build 37. That have been solved in build 38. Also notice that the RefSeq transcript (ENST00000377540) exists in both builds so no variants should be missed here.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000551042 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000547376 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000551573 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000526956 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000547485 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000551624 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000547881 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000549355 | - |
37 | 12 | ENSG00000139304 | PTPRQ | ENST00000266688 | - |
--- | --- | --- | --- | --- | --- |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000623635 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000551042 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000547376 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000551573 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000614701 | NM_001145026 |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000547485 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000551624 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000547881 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000549355 | - |
38 | 12 | ENSG00000139304 | PTPRQ | ENST00000616559 | XM_017019274 |
Here the refseq identifiers have been added in build 38, the NM-transcript does not seem to exist in 37 so we can not be sure that we are missing variants.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 2 | ENSG00000049319 | SRD5A2 | ENST00000405650 | - |
37 | 2 | ENSG00000049319 | SRD5A2 | ENST00000233139 | - |
--- | --- | --- | --- | --- | --- |
38 | 2 | ENSG00000277893 | SRD5A2 | ENST00000622030 | NM_000348 |
Here we can see that two transcripts has been merged to one and been given a refseq identifier. Most probable one of the old transcripts match the new one. We can not be sure if variants are missed here.
DN: reference genome is missing a couple of bp. A different transcript build is needed to resolve the issue completely. GoldenPath is not good enough.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 4 | ENSG00000255072 | PIGY | ENST00000527353 | - |
--- | --- | --- | --- | --- | --- |
38 | 4 | ENSG00000255072 | PIGY | ENST00000527353 | - |
Here the transcript is identical for both builds so we dear to guess that we do not miss any variants.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | X | ENSG00000196468 | FGF16 | ENST00000439435 | - |
37 | HG1426_PATCH | ENSG00000268853 | FGF16 | ENST00000600602 | - |
--- | --- | --- | --- | --- | --- |
38 | X | ENSG00000196468 | FGF16 | ENST00000439435 | NM_003868 |
The refseq transcript exists in build 37 but is missing the refseq identifier. We would NOT miss any variants here.
DN: The actual exon is missing from the reference genome sequence (just a gap with NNNNNN there) so no amount of transcript fixing will solve this. hg38 or a patch-build is needed.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 14 | ENSG00000229164 | TRAC | ENST00000478163 | - |
--- | --- | --- | --- | --- | --- |
38 | 14 | ENSG00000277734 | TRAC | ENST00000611116 | - |
Transcripts have different ENS ids in both builds but are most probably the same. No refseq identifier for any build.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 1 | ENSG00000256049 | PADI6 | ENST00000434762 | - |
37 | 1 | ENSG00000256049 | PADI6 | ENST00000358481 | - |
--- | --- | --- | --- | --- | --- |
38 | 1 | ENSG00000276747 | PADI6 | ENST00000619609 | NM_207421 |
38 | CHR_HG2095_PATCH | ENSG00000280949 | PADI6 | ENST00000625380 | NM_207421 |
Unclear what happens here...
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 19 | ENSG00000130283 | GDF1 | ENST00000247005 | - |
--- | --- | --- | --- | --- | --- |
38 | 19 | ENSG00000130283 | GDF1 | ENST00000247005 | - |
Same transcripts, no refseq in both builds. No variants will be missed here
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 16 | ENSG00000198211 | TUBB3 | ENST00000556922 | - |
--- | --- | --- | --- | --- | --- |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000555810 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000554444 | NM_001197181 |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000556565 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000315491 | NM_006086 |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000553656 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000556536 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000554116 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000554927 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000557262 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000557490 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000555576 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000554336 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000555609 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000553967 | - |
38 | 16 | ENSG00000258947 | TUBB3 | ENST00000625617 | - |
This is very unclear, it goes from one to many transcripts between the builds. Hard to say if variants are missed here.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA | |
---|---|---|---|---|---|---|
37 | 14 | ENSG00000211893 | IGHG2 | ENST00000390545 | - | |
37 | HG1592_PATCH | ENSG00000270895 | IGHG2 | ENST00000603473 | - | |
--- | --- | --- | --- | --- | --- | |
38 | CHR_HSCHR14_3_CTG1 | ENSG00000274497 | IGHG2 | ENST00000621803 | - | |
38 | 14 | ENSG00000211893 | IGHG2 | ENST00000641095 | - | |
38 | 14 | ENSG00000211893 | IGHG2 | ENST00000390545 | - |
No refseq in any build. Same transcripts in both builds except a new one in a patch. No variants would be missed here.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 14 | ENSG00000211899 | IGHM | ENST00000390559 | - |
37 | HG1592_PATCH | ENSG00000271541 | IGHM | ENST00000605693 | - |
--- | --- | --- | --- | --- | --- |
37 | CHR_HSCHR14_3_CTG1 | ENSG00000282657 | IGHM | ENST00000626472 | - |
37 | 14 | ENSG00000211899 | IGHM | ENST00000637539 | - |
37 | 14 | ENSG00000211899 | IGHM | ENST00000390559 | - |
Same as above, no variants missing here
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 2 | ENSG00000211592 | IGKC | ENST00000390237 | - |
--- | --- | --- | --- | --- | --- |
38 | 2 | ENSG00000211592 | IGKC | ENST00000390237 | - |
Same in both builds. No variants would be missing here
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000502411 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000496692 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000466542 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000465075 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000473530 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000473712 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000482226 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000467903 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000507374 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000508651 | - |
37 | 1 | ENSG00000244682 | FCGR2C | ENST00000543859 | - |
--- | --- | --- | --- | --- | --- |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000502411 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000496692 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000466542 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000465075 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000473530 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000473712 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000482226 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000467903 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000507374 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000508651 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000611236 | - |
38 | 1 | ENSG00000244682 | FCGR2C | ENST00000543859 | NR_047648 |
Most probably no variants would be missing here. All transcripts seems to be correct between the builds.
DN: Polymorphic gene w pseudogenes. Likely a lot of multimapping reads.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 19 | ENSG00000105663 | KMT2B | ENST00000606995 | - |
37 | 19 | ENSG00000105663 | KMT2B | ENST00000607650 | - |
37 | 19 | ENSG00000105663 | KMT2B | ENST00000592092 | - |
37 | 19 | ENSG00000105663 | KMT2B | ENST00000585476 | - |
37 | 19 | ENSG00000105663 | KMT2B | ENST00000586308 | - |
--- | --- | --- | --- | --- | --- |
38 | 19 | ENSG00000272333 | KMT2B | ENST00000420124 | NM_014727 |
38 | 19 | ENSG00000272333 | KMT2B | ENST00000606995 | - |
38 | 19 | ENSG00000272333 | KMT2B | ENST00000592092 | - |
38 | 19 | ENSG00000272333 | KMT2B | ENST00000585476 | - |
38 | 19 | ENSG00000272333 | KMT2B | ENST00000586308 | - |
Here the transcript ENST00000607650
in build 37 have probably changed id to ENST00000420124
and been given a refseq identifier.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 8 | ENSG00000104725 | NEFL | ENST00000221169 | - |
--- | --- | --- | --- | --- | --- |
38 | 8 | ENSG00000277586 | NEFL | ENST00000610854 | NM_006158 |
38 | 8 | ENSG00000277586 | NEFL | ENST00000615973 | - |
38 | 8 | ENSG00000277586 | NEFL | ENST00000639464 | - |
38 | 8 | ENSG00000277586 | NEFL | ENST00000619417 | - |
This is unclear, probably ENST00000221169
has been renamed to ENST00000610854
and been given a refseq identifier.
DN: looks reasonably ok on UCSC RefSeq GoldenPath NM_006158 - single base intron might be the issue.
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000561604 | - |
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000567496 | - |
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000562839 | - |
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000562925 | - |
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000563709 | - |
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000398840 | - |
37 | 15 | ENSG00000031544 | NR2E3 | ENST00000326995 | - |
--- | --- | --- | --- | --- | --- |
38 | 15 | ENSG00000278570 | NR2E3 | ENST00000621736 | - |
38 | 15 | ENSG00000278570 | NR2E3 | ENST00000617575 | NM_014249 |
38 | 15 | ENSG00000278570 | NR2E3 | ENST00000621098 | NM_016346 |
38 | 15 | ENSG00000278570 | NR2E3 | ENST00000563709 | - |
This is unclear. We might miss variants here
Build | Chrom | EnsGeneID | Gene Name | EnsTransID | RefSeq mRNA |
---|---|---|---|---|---|
37 | 17 | ENSG00000006062 | MAP3K14 | ENST00000587332 | - |
37 | 17 | ENSG00000006062 | MAP3K14 | ENST00000592267 | - |
37 | 17 | ENSG00000006062 | MAP3K14 | ENST00000586644 | - |
37 | 17 | ENSG00000006062 | MAP3K14 | ENST00000344686 | - |
37 | 17 | ENSG00000006062 | MAP3K14 | ENST00000376926 | - |
--- | --- | --- | --- | --- | --- |
38 | 17 | ENSG00000006062 | MAP3K14 | ENST00000344686 | NM_003954 |
38 | 17 | ENSG00000006062 | MAP3K14 | ENST00000640087 | - |
38 | 17 | ENSG00000006062 | MAP3K14 | ENST00000592267 | - |
38 | 17 | ENSG00000006062 | MAP3K14 | ENST00000586644 | - |
38 | 17 | ENSG00000006062 | MAP3K14 | ENST00000617331 | - |
38 | 17 | ENSG00000006062 | MAP3K14 | ENST00000376926 | - |
38 | CHR_HSCHR17_2_CTG5 | ENSG00000282637 | MAP3K14 | ENST00000633437 | - |
38 | CHR_HSCHR17_2_CTG5 | ENSG00000282637 | MAP3K14 | ENST00000634016 | - |
Here one of the transcripts have been given a refseq identifier in build 38, that transcript exists in build 37 so no variants would be missed here.
SRD5A2
NM-transkript finns ej med i genlistan - dvs vi missar mutationer i denna gen (DSD panelen, OMIM-panelen etc)
ALLVARLIGT FEL!
Hur länge har detta varit så? Finns det fler gener med detta fel??
Kan vi lätt reanalysera alla prover avseende DSD???
// Kicki