Open jblachly opened 3 years ago
Requires latest dhtslib develop
branch commit, and latest intervaltree generic_basicinterval
branch.
Compiles, probably works minus needing to re-reference some alleles ala VCF etc. Not tested.
@charlesgregory feel free to help with compilation and testing
@Kekananen and @charlesgregory test the latest version which should work with MAF; only missing case(???) is if there are multiple output intervals for a single input row, we punt on this and skip (but should report this metric). This should be rare, may show up with SVs or long-ish indels.
@Kekananen would be interested in knowing in real world liftover how often this is seen
@jblachly After compiling, I think one of the main issues is that start
and end
are actually labeled Start_Position
and End_Position
in MAF files. This holds true for current ones on DNAnexus' as well as for the cancerhotspots' db one. I've fiddled it down to fixing where those are in the script, i.e long start = fields.data[MAF.Start_Position].to!int;
. As well as changing other relating values after and prior.
However, after fixing the fields, auto trimmedLinks = cf.lift(fields.data[MAF.Contig], start, end);
is throwing Segmentation fault (core dumped)
which tracks back to version(commonAPI) auto o = this.chainsByContig[contig].findOverlapsWith(i); // returns Node*(s)
in the chain.d script not working.
Nothing in the code (should) depends on the header line's field names. MAF
is an enum and is just an integer lookup into the fields
array.
If you head the first 10 lines of a MAF file and test that, does it crash, or is it some line somewhere later in the MAF that crashes it? If the former, can you paste in those 10 lines so i can reproduce
Still is crashing. Here is the first 10 lines it crashes on.
#version 2.4
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att 1 -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 MODERATE 1 SNV . ACC . . acyc exome RETAIN V334I 334 V I NA WARS2 334 360 ACC headandneck saca acyc 000236
OPN3 23596 . GRCh37 1 241761094 241761094 + Missense_Mutation SNP G G A rs780348058 000236 NORMAL G G c.899C>T p.Ser300Leu p.S300L ENST00000366554 3/4 0 . . 0 . . OPN3,missense_variant,p.Ser300Leu,ENST00000366554,NM_014322.2;OPN3,missense_variant,p.Ser221Leu,ENST00000331838,;KMO,downstream_gene_variant,,ENST00000366559,NM_003679.4;KMO,downstream_gene_variant,,ENST00000366557,;KMO,downstream_gene_variant,,ENST00000366555,;OPN3,non_coding_transcript_exon_variant,,ENST00000469376,;OPN3,non_coding_transcript_exon_variant,,ENST00000490673,;OPN3,non_coding_transcript_exon_variant,,ENST00000478849,;OPN3,non_coding_transcript_exon_variant,,ENST00000463155,;OPN3,non_coding_transcript_exon_variant,,ENST00000462265,; A ENSG00000054277 ENST00000366554 Transcript missense_variant 1006/2620 899/1209 300/402 S/L tCg/tTg rs780348058 1 -1 OPN3 HGNC 14007 protein_coding YES CCDS31072.1 ENSP00000355512 Q9H1Y3 UPI000000165B NM_014322.2 deleterious(0.02) possibly_damaging(0.692) 3/4 Transmembrane_helices:TMhelix,PROSITE_profiles:PS50262,hmmpanther:PTHR24240:SF64,hmmpanther:PTHR24240,PROSITE_patterns:PS00238,Gene3D:1.20.1070.10,Pfam_domain:PF00001,Superfamily_domains:SSF81321,Prints_domain:PR00237 MODERATE 1 SNV 9.415e-06 0 0 0.0001278 0 0 0 0 . CGA . . 9.426e-06 1/106086 1/106208 0/9066 0/11158 1/7822 0/6612 0/54326 0/694 0/16408 PASS acyc exome RETAIN S300L 300 S L NA OPN3 300 402 TCG headandneck saca acyc 000236
NAA16 79612 . GRCh37 13 41894863 41894863 + Missense_Mutation SNP G G A novel 000236 NORMAL G G c.305G>A p.Cys102Tyr p.C102Y ENST00000379406 4/20 0 . . 0 . . NAA16,missense_variant,p.Cys102Tyr,ENST00000379406,NM_024561.4;NAA16,missense_variant,p.Cys102Tyr,ENST00000379367,;NAA16,missense_variant,p.Cys102Tyr,ENST00000403412,NM_018527.3,NM_001110798.1;NAA16,non_coding_transcript_exon_variant,,ENST00000476980,;NAA16,missense_variant,p.Cys102Tyr,ENST00000464857,; A ENSG00000172766 ENST00000379406 Transcript missense_variant 629/4347 305/2595 102/864 C/Y tGt/tAt 1 1 NAA16 HGNC 26164 protein_coding YES CCDS9379.1 ENSP00000368716 Q6N069 A4FU51 UPI00001B559E NM_024561.4 deleterious(0) probably_damaging(0.995) 4/20 Gene3D:1.25.40.10,Pfam_domain:PF13414,PIRSF_domain:PIRSF000422,PROSITE_profiles:PS50005,PROSITE_profiles:PS50293,hmmpanther:PTHR22767,hmmpanther:PTHR22767:SF5,SMART_domains:SM00028,Superfamily_domains:SSF48452 MODERATE 1 SNV 1 . TGT . . acyc exome RETAIN C102Y 102 C Y NA NAA16 102 864 ACA headandneck saca acyc 000236
DHODH 1723 . GRCh37 16 72056354 72056354 + Missense_Mutation SNP A A C novel 000236 NORMAL A A c.799A>C p.Ile267Leu p.I267L ENST00000219240 6/9 0 . . 0 . . DHODH,missense_variant,p.Ile267Leu,ENST00000572887,;DHODH,missense_variant,p.Ile267Leu,ENST00000219240,NM_001361.4;DHODH,intron_variant,,ENST00000574309,;DHODH,intron_variant,,ENST00000573922,;DHODH,intron_variant,,ENST00000571392,;DHODH,downstream_gene_variant,,ENST00000573843,;DHODH,downstream_gene_variant,,ENST00000572003,; C ENSG00000102967 ENST00000219240 Transcript missense_variant 820/2065 799/1188 267/395 I/L Att/Ctt 1 1 DHODH HGNC 2867 protein_coding YES CCDS42192.1 ENSP00000219240 Q02127 J3QRQ3 UPI00001FF5FB NM_001361.4 deleterious(0) probably_damaging(0.99) 6/9 hmmpanther:PTHR11938,Gene3D:3.20.20.70,Pfam_domain:PF01180,TIGRFAM_domain:TIGR01036,PIRSF_domain:PIRSF000164,Superfamily_domains:SSF51395 MODERATE 1 SNV 1 . CAT . . acyc exome RETAIN I267L 267 I L NA DHODH 267 395 ATG headandneck saca acyc 000236
KRTAP21-1 337977 . GRCh37 21 32127473 32127473 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.224G>A p.Cys75Tyr p.C75Y ENST00000335093 1/1 0 . . 0 . . KRTAP21-1,missense_variant,p.Cys75Tyr,ENST00000335093,NM_181619.1; T ENSG00000187005 ENST00000335093 Transcript missense_variant 274/308 224/240 75/79 C/Y tGc/tAc 1 -1 KRTAP21-1 HGNC 18945 protein_coding YES CCDS13606.1 ENSP00000335566 Q3LI58 UPI00001A9E4A NM_181619.1 deleterious_low_confidence(0) unknown(0) 1/1 hmmpanther:PTHR31294,hmmpanther:PTHR31294:SF3 MODERATE 1 SNV . GCA . . acyc exome RETAIN C75Y 75 C Y NA KRTAP21-1 75 79 GCA headandneck saca acyc 000236
PTPRG 5793 . GRCh37 3 62204575 62204575 + Nonsense_Mutation SNP G G T novel 000380 NORMAL G G c.2206G>T p.Glu736Ter p.E736* ENST00000474889 13/30 0 . . 0 . . PTPRG,stop_gained,p.Glu736Ter,ENST00000474889,NM_002841.3;PTPRG,stop_gained,p.Glu736Ter,ENST00000295874,;PTPRG,non_coding_transcript_exon_variant,,ENST00000475012,; T ENSG00000144724 ENST00000474889 Transcript stop_gained 2583/9021 2206/4338 736/1445 E/* Gag/Tag 1 1 PTPRG HGNC 9671 protein_coding YES CCDS2895.1 ENSP00000418112 P23470 O60420 UPI00001AEBFB NM_002841.3 13/30 hmmpanther:PTHR19134,hmmpanther:PTHR19134:SF189 HIGH 1 SNV 1 . GGA . . acyc exome RETAIN E736* 736 E * NA PTPRG 736 1445 TCC headandneck saca acyc 000380
SIPA1L2 57568 . GRCh37 1 232539886 232539886 + Missense_Mutation SNP G G A rs867423404 000410 NORMAL G G c.4801C>T p.His1601Tyr p.H1601Y ENST00000366630 19/22 0 . . 0 . . SIPA1L2,missense_variant,p.His1601Tyr,ENST00000366630,;SIPA1L2,missense_variant,p.His1601Tyr,ENST00000262861,NM_020808.3;SIPA1L2,intron_variant,,ENST00000308942,;SIPA1L2,intron_variant,,ENST00000495863,;SIPA1L2,upstream_gene_variant,,ENST00000494056,; A ENSG00000116991 ENST00000366630 Transcript missense_variant 5160/6690 4801/5169 1601/1722 H/Y Cac/Tac rs867423404 1 -1 SIPA1L2 HGNC 23800 protein_coding YES CCDS41474.1 ENSP00000355589 Q9P2F8 UPI00001D7D6A deleterious(0.01) possibly_damaging(0.545) 19/22 Pfam_domain:PF11881,hmmpanther:PTHR15711:SF7,hmmpanther:PTHR15711 MODERATE 1 SNV . TGG . . acyc exome RETAIN H1601Y 1601 H Y NA SIPA1L2 1601 1722 CCA headandneck saca acyc 000410
FGD6 55785 . GRCh37 12 95483365 95483365 + Missense_Mutation SNP T T C novel 000410 NORMAL T T c.3958A>G p.Ile1320Val p.I1320V ENST00000343958 18/21 0 . . 0 . . FGD6,missense_variant,p.Ile1320Val,ENST00000343958,NM_018351.3;FGD6,missense_variant,p.Ile1320Val,ENST00000546711,;FGD6,missense_variant,p.Ile64Val,ENST00000548069,;FGD6,downstream_gene_variant,,ENST00000549499,;FGD6,downstream_gene_variant,,ENST00000551521,;FGD6,3_prime_UTR_variant,,ENST00000451107,; C ENSG00000180263 ENST00000343958 Transcript missense_variant 4182/9288 3958/4293 1320/1430 I/V Atc/Gtc 1 -1 FGD6 HGNC 21740 protein_coding YES CCDS31878.1 ENSP00000344446 Q6ZV73 F8VY01 UPI00001FB2F4 NM_018351.3 tolerated(0.15) benign(0.168) 18/21 hmmpanther:PTHR12673:SF12,hmmpanther:PTHR12673,Superfamily_domains:SSF50729 MODERATE 1 SNV . ATT . . acyc exome RETAIN I1320V 1320 I V NA FGD6 1320 1430 ATT headandneck saca acyc 000410
LMK if adding an if
condition to skip the second, un-hash-marked header line fixes it
Apologies in the delay in updating on this; after adding in a few different style of if statements (though not at once), the result either stayed the same. The first case being that it seems to be read in Unexpected 'S' when converting from type char[] to type int
still even with if statements to element the line starting with Hugo
or Hugo_Symbol
as I tried various ways in the conditionals.
In the case of editing the file directly to have a #
for the second header line then Segmentation fault (core dumped)
was thrown. This is something I may have already tracked down prior to line auto o = this.chainsByContig[contig].findOverlapsWith(i); // returns Node*(s)
assuming the previous fixes I did are implemented.
This is with the same dataset as above or with the entire set from cosmic. The only change being a #Hugo_Symbol
on line 2 if trying the core dumped error.
Can you please reduce the failure case to a single row
If you're referring to the dataset:
#version 2.4
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att 1 -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 MODERATE 1 SNV . ACC . . acyc exome RETAIN V334I 334 V I NA WARS2 334 360 ACC headandneck saca acyc 000236
Then the above throws: Unexpected 'S' when converting from type char[] to type int
commits up to 9f407db59d0d2351ec3ee4dadefa3c7402299ab4 should fix your issues ; there may be other problems for larger indels spanning many nt, but this should work for all SNVs. Let me know
You'll need latest commit in dhtslib because I discovered coordinate translation bug in the faidx fetch sequence routine
Using htslib up to that commit, it's now throwing:
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
Segmentation fault (core dumped)
This is using any file including the prior one or multi liner test file, or the original.
#version 2.4
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att 1 -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 MODERATE 1 SNV . ACC . . acyc exome RETAIN V334I 334 V I NA WARS2 334 360 ACC headandneck saca acyc 000236
Okay, after rechecking everything:
htslib is at the newest commit 238600d2e0401b893ead71f967f0488f5d144e63
dhtslib is also at the newest commit 85493f200541037604820608c5ad041c74a20bf9
path is set at LD_LIBRARY_PATH=/tools/htslib/lib/ LIBRARY_PATH=/tools/htslib/lib/
command as follows:
./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain.gz -g ../../databases/GRCh38.d1.vd1.fa -b GRCh38 -i ../test.maf -u unmatched.hotspots.maf > test.maf
error is then thrown as follows:
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
std.conv.ConvException@/home/OSUMC.EDU/kana18/dlang/ldc-1.24.0/bin/../import/std/conv.d(2382): Unexpected end of input when converting from type char[] to type long
----------------
??:? [0x55d625b29065]
??:? [0x55d625b349e6]
??:? [0x55d625b16b7d]
??:? [0x55d625ad9f1b]
??:? [0x55d625ad9d57]
??:? [0x55d625ac5e15]
??:? [0x55d625abb451]
??:? [0x55d625ac151c]
??:? [0x55d625abee7b]
??:? [0x55d625b1684b]
??:? [0x55d625b16747]
??:? [0x55d625b1659d]
??:? __libc_start_main [0x7fe56474bbf6]
??:? [0x55d625ab9ef9]
Requirements Chain file If you are working with human data, you can quickly grab hg19 to hg38 (UCSC-style contig naming: chr1, chrM) and GRCh37 to GRCh38 (Ensembl-style contig naming: 1, MT) by issuing make chains. Chainfiles will be placed in resources/, and for the time being must be un-gzipped.
Also note that your GRCh38 fasta file uses chr1
style naming, so I had to use a different destination genome. Alternatiively, you can also find chain files that in addition to lifting over coordinates change contig names, so the chain file would be along the lines of GRCh37_to_Hg38.chain
blac96@cancer-f5rdy12:/home/OSUMC.EDU/kana18/tools/swiftover$ ./swiftover -t maf -c ~/source/blachlylab/swiftover/resources/GRCh37_to_GRCh38.chain -g ~/source/blachlylab/swiftover/resources/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -b ncbi_build -i ../test.maf -u /tmp/unmatched.hotspots.maf -o /tmp/test.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[I::swiftover.maf.liftMAF] Matched 2 records (0 multiply, skipped) and 0 unmatched; 2 REF allele changes
blac96@cancer-f5rdy12:/home/OSUMC.EDU/kana18/tools/swiftover$ wc -l ../test.maf
4 ../test.maf
However there should not be any REF allele changes . output MAF shows dinucleotide "ref" which I fixed in the latest commit for dhtslib. Not sure what is going on
Secondly I noticed that the output MAF file only had one row output, so I am not sure what is going on there yet
OK the first problem (REF changing when it shnouoldn't ; dinucleotides -- which I already fixed) this is an issue where you have updated dub.selections.json to use the ~develop branch (which is fine) , but since it is a remote endpoint it was not updated.
To best deal with this for now, change your dub.json file to read
"dhtslib": "~develop",
"dklib": "~>0.1.1",
"intervaltree": "~generic_basicinterval"
then run dub upgrade
.
Now rebuild
Works fine (except I need to write a header to the unmatched MAF)
blac96@cancer-f5rdy12:~/source/blachlylab/swiftover$ ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain -g resources/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -b GRCh38 -i resources/test.maf -o /tmp/test.maf -u /tmp/unmatched.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[I::swiftover.maf.liftMAF] Matched 8 records (0 multiply, skipped) and 0 unmatched; 0 REF allele changes
blac96@cancer-f5rdy12:~/source/blachlylab/swiftover$ wc -l resources/test.maf /tmp/test.maf /tmp/unmatched.maf
10 resources/test.maf
10 /tmp/test.maf
0 /tmp/unmatched.maf
Thanks! This works on the full /databases/cancerhotspots/cancerhotspots.v2.maf
with the following warnings. As you said there shouldn't be any REF allele changes however, especially not 2511300 of them seeing as there were 2511317 matches in total. I've updated as you've said I've changed the dub.json
with the instructed lines and built.
Switching to Homo_sapiens.GRCh38.dna.primary_assembly.fa
for -g
The test runs as you show:
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[I::swiftover.maf.liftMAF] Matched 1 records (0 multiply, skipped) and 0 unmatched; 0 REF allele changes
The whole file does not however:
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
......
......
[I::swiftover.maf.liftMAF] Matched 2511317 records (17 multiply, skipped) and 7731 unmatched; 2511300 REF allele changes
This will induce the warnings. There is also a larger test file called noRefTest.maf in the tools directory that induces 49 of these warnings.
#version 2.4
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
C3 718 . GRCh37 19 6697759 6697760 + Frame_Shift_Ins INS - - A novel 01-P034 NA A A c.2486dupT p.Phe830LeufsTer94 p.F830Lfs*94 ENST00000245907 20/41 0 . . 0 . . C3,frameshift_variant,p.Phe830LeufsTer94,ENST00000245907,NM_000064.2;C3,non_coding_transcript_exon_variant,,ENST00000602053,;C3,upstream_gene_variant,,ENST00000598805,;C3,upstream_gene_variant,,ENST00000594005,; A ENSG00000125730 ENST00000245907 Transcript frameshift_variant 2579-2580/5263 2486-2487/4992 829/1663 F/FX ttc/ttTc 1 -1 C3 HGNC 1318 protein_coding YES CCDS32883.1 ENSP00000245907 P01024 Q6LDJ0,M0R1Q1 UPI000013EC9B NM_000064.2 20/41 Pfam_domain:PF00207,hmmpanther:PTHR11412,hmmpanther:PTHR11412:SF81
The first ~200 lines in the original cancer hotspots maf don't cause this warning to appear by the way.
So this is a warning that indicates something called the "tumor" allele in the MAF was not the refrence allele in the original genome (actually I may not be making that check explicitly -- will add), but IS now the reference allele in the new, destination genome. This happens from time to time, but is surprising in the context of what we thought was a somatic mutation hotspot.
That being said, it is not an error which is why it is emitted as a warning. However, I should possibly find some other way to flag these so they can be examined
This is a problem though:
[I::swiftover.maf.liftMAF] Matched 2511317 records (17 multiply, skipped) and 7731 unmatched; 2511300 REF allele changes
Ah apologies, my brain translated it to error as due to the amount of them it was no longer a warning in my eyes. I would assume Swiftover is classifying tumor allele as the ALT.
Also I agree that maybe sometimes a few REF-->ALT changes would be expected, the exact number I'm not sure, but it does seem odd that it changes in a recombinant hotspot location. Due to their rapid evolution in general, I would think those that are stable enough to be cataloged as what I would call "population hotspots" or "sub-population hotspots" (depending on how in-depth the originating population group was) wouldn't have as a dramatic change in a liftover as say if this was a random non-ref individual.
Any-who, I'll dig around about whether some ways/methods to save some of the unmatched if any corrections can be done after.
So make sure you have recompiled with the latest dhtslib and intervaltree dependencies.
Here is what I get:
blac96@cancer-f5rdy12:~/source/blachlylab/swiftover$ time ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain -g resources/Homo_sapiens
.GRCh38.dna.toplevel.fa.gz -b GRCh38 -i ~kana18/databases/cancerhotspots/cancerhotspots.v2.maf -o /tmp/hotspots_splay.maf -u /tmp/unmapped_
splay.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[W::swiftover.maf.liftMAF] New REF matches tumor allele
...
[I::swiftover.maf.liftMAF] Matched 2511317 records (17 multiply, skipped) and 7731 unmatched; 56443 REF allele changes
real 13m2.187s
user 12m17.739s
sys 0m33.966s
When I look at your output MAF, it shows tha tyou are still running a binary compiled with a version of dhtslib that misses the latest commit, which fixes the coordinate translation problem. I have no idea why this is.
@Kekananen I recompiled it for you. not sure wh y you had old cached lib functions. It works now
Make sure you are doing conda deactivate
(or preferably just remove the auto activation of your env from your login files) before trying to compile things
kana18@cancer-f5rdy12:~/tools/swiftover$ ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain -g ~blac96/source/blachlylab/swiftover/resources/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -b GRCh38 -i <(head -100000 ../../databases/cancerhotspots/cancerhotspots.v2.maf) -o /tmp/kana
_hotspots.maf -u /tmp/kana_unmatched.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[I::swiftover.maf.liftMAF] Matched 99692 records (0 multiply, skipped) and 306 unmatched; 10905 REF allele changes
Note the reason there are about 10% REF allele changes is actually because of the way MAF represents insertions specifically (should be represented as -
, despite that they include start and end coordinates corresponding to the lenght of the insertion, which seems strange. I made a new issue in the repo for this.
I think I've removed conda once again from my .bashrc
, though I've removed it prior to this so I must be doing something to re add it's auto activation since it crept back in. I'll take some time this weekend to do some house keeping and make sure everything is tidy once again.
I'd be interested in knowing how you recompiled it differently from me so I can also compile later in that way if need be. By the hint of it though, sounds like you might have just partially cleared a cache somewhere.
Thanks for getting this implemented.
Commit e6990e796f7bfafe209c08eb5e8f27da8c68df2e brings MAF support close to completion
With the newest commit after the update using the same example file above the following is thrown:
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[W::swiftover.maf.liftMAF] Only fields build, chr, start, end, ref/tumor1/tumor2 are updated.
[W::swiftover.maf.liftMAF] A reference allele change might invalidate other fields.
[I::swiftover.maf.liftMAF] Reading chainfile
core.exception.AssertError@source/swiftover/chain.d(294): Assertion failure
----------------
??:? [0x557ecbedf6a5]
??:? [0x557ecbeeb026]
??:? [0x557ecbecd1bd]
??:? [0x557ecbec4fcc]
chain.d:294 [0x557ecbe5415f]
chain.d:557 [0x557ecbe3fe6d]
maf.d:132 [0x557ecbe48b70]
main.d:112 [0x557ecbe4642c]
??:? [0x557ecbecce8b]
??:? [0x557ecbeccd87]
??:? [0x557ecbeccbdd]
/home/OSUMC.EDU/kana18/dlang/ldc-1.24.0/bin/../import/core/internal/entrypoint.d:42 [0x557ecbe4a8e4]
??:? __libc_start_main [0x7f33be3c3bf6]
??:? [0x557ecbe3df59]
Debug mode shows this
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[W::swiftover.maf.liftMAF] Only fields build, chr, start, end, ref/tumor1/tumor2 are updated.
[W::swiftover.maf.liftMAF] A reference allele change might invalidate other fields.
[I::swiftover.maf.liftMAF] Reading chainfile
std.conv.ConvException@/home/OSUMC.EDU/kana18/dlang/ldc-1.24.0/bin/../import/std/conv.d(2382): Unexpected end of input when converting from type char[] to type
----------------
Using ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain.gz -g $HOME/databases/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa -b GRCh38 -i $HOME/tools/test.maf -u $HOME/tools/unmatched.maf -o $HOME/tools/test.38.maf
I have checked and have the latest dependencies and this persists on the larger file. Let me know if I can help in any way to fix this!
https://github.com/blachlylab/swiftover/issues/23#issuecomment-737426268
Requirements Chain file If you are working with human data, you can quickly grab hg19 to hg38 (UCSC-style contig naming: chr1, chrM) and GRCh37 to GRCh38 (Ensembl-style contig naming: 1, MT) by issuing make chains. Chainfiles will be placed in resources/, and for the time being must be un-gzipped.
Probably I should add a test to see if the filename ends in .gz, .bz2, etc.
Indeed, this was a quick resolve. The file is lifted over beautifully now.
@Kekananen check out the 'maf' branch