blachlylab / swiftover

Fast liftover
GNU General Public License v3.0
1 stars 0 forks source link

MAF format #23

Open jblachly opened 3 years ago

jblachly commented 3 years ago

@Kekananen check out the 'maf' branch

jblachly commented 3 years ago

Requires latest dhtslib develop branch commit, and latest intervaltree generic_basicinterval branch. Compiles, probably works minus needing to re-reference some alleles ala VCF etc. Not tested.

jblachly commented 3 years ago

@charlesgregory feel free to help with compilation and testing

jblachly commented 3 years ago

@Kekananen and @charlesgregory test the latest version which should work with MAF; only missing case(???) is if there are multiple output intervals for a single input row, we punt on this and skip (but should report this metric). This should be rare, may show up with SVs or long-ish indels.

@Kekananen would be interested in knowing in real world liftover how often this is seen

Kekananen commented 3 years ago

@jblachly After compiling, I think one of the main issues is that start and end are actually labeled Start_Position and End_Position in MAF files. This holds true for current ones on DNAnexus' as well as for the cancerhotspots' db one. I've fiddled it down to fixing where those are in the script, i.e long start = fields.data[MAF.Start_Position].to!int;. As well as changing other relating values after and prior.

However, after fixing the fields, auto trimmedLinks = cf.lift(fields.data[MAF.Contig], start, end); is throwing Segmentation fault (core dumped) which tracks back to version(commonAPI) auto o = this.chainsByContig[contig].findOverlapsWith(i); // returns Node*(s) in the chain.d script not working.

jblachly commented 3 years ago

Nothing in the code (should) depends on the header line's field names. MAF is an enum and is just an integer lookup into the fields array.

If you head the first 10 lines of a MAF file and test that, does it crash, or is it some line somewhere later in the MAF that crashes it? If the former, can you paste in those 10 lines so i can reproduce

Kekananen commented 3 years ago

Still is crashing. Here is the first 10 lines it crashes on.

#version 2.4
Hugo_Symbol     Entrez_Gene_Id  Center  NCBI_Build      Chromosome      Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele        Tumor_Seq_Allele1       Tumor_Seq_Allele2       dbSNP_RS        dbSNP_Val_Status        Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode     Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1        Tumor_Validation_Allele2        Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2   Verification_Status     Validation_Status       Mutation_Status Sequencing_Phase        Sequence_Source Validation_Method       Score   BAM_File        Sequencer       Tumor_Sample_UUID       Matched_Norm_Sample_UUID        HGVSc   HGVSp   HGVSp_Short     Transcript_ID   Exon_Number     t_depth t_ref_count     t_alt_count     n_depth n_ref_count     n_alt_count     all_effects     Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      ALLELE_NUM      DISTANCE        STRAND_VEP      SYMBOL  SYMBOL_SOURCE   HGNC_ID BIOTYPE CANONICAL       CCDS    ENSP    SWISSPROT       TREMBL  UNIPARC RefSeq  SIFT    PolyPhen        EXON    INTRON  DOMAINS AF      AFR_AF  AMR_AF  ASN_AF  EAS_AF  EUR_AF  SAS_AF  AA_AF   EA_AF   CLIN_SIG        SOMATIC PUBMED  MOTIF_NAME      MOTIF_POS       HIGH_INF_POS    MOTIF_SCORE_CHANGE      IMPACT  PICK    VARIANT_CLASS   TSL     HGVS_OFFSET     PHENO   MINIMISED       ExAC_AF ExAC_AF_AFR     ExAC_AF_AMR     ExAC_AF_EAS     ExAC_AF_FIN     ExAC_AF_NFE     ExAC_AF_OTH     ExAC_AF_SAS     GENE_PHENO      FILTER  flanking_bps    variant_id      variant_qual    ExAC_AF_Adj     ExAC_AC_AN_Adj  ExAC_AC_AN      ExAC_AC_AN_AFR  ExAC_AC_AN_AMR  ExAC_AC_AN_EAS  ExAC_AC_AN_FIN  ExAC_AC_AN_NFE  ExAC_AC_AN_OTH  ExAC_AC_AN_SAS  ExAC_FILTER     gnomAD_AF       gnomAD_AFR_AF   gnomAD_AMR_AF   gnomAD_ASJ_AF   gnomAD_EAS_AF   gnomAD_FIN_AF   gnomAD_NFE_AF   gnomAD_OTH_AF   gnomAD_SAS_AF   TUMORTYPE       PLATFORM        judgement       Amino_Acid_Change       Amino_Acid_Position     Protein_Lenght  Reference_Amino_Acid    Variant_Amino_Acid      allele_freq     tm      Amino_Acid_Length       Ref_Tri oncotree_organtype      oncotree_parent oncotree_detailed       Master_ID
WARS2   10352   .       GRCh37  1       119575617       119575617       +       Missense_Mutation       SNP     C       C       T       novel           000236  NORMAL  C       C                                                                                                                               c.1000G>A       p.Val334Ile     p.V334I ENST00000235521 6/6     0       .       .       0       .       .       WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,;     T       ENSG00000116874 ENST00000235521 Transcript      missense_variant        1027/2800       1000/1083       334/360 V/I     Gtt/Att         1               -1      WARS2   HGNC    12730   protein_coding  YES     CCDS900.1       ENSP00000235521 Q9UGM6  B7Z5X7  UPI000004A002   NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015)   6/6             Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233                                                                                                                                 MODERATE        1       SNV                                                                                                             .       ACC     .       .                                                                                                                                                                       acyc    exome   RETAIN  V334I   334             V       I       NA      WARS2 334       360     ACC     headandneck     saca    acyc    000236
OPN3    23596   .       GRCh37  1       241761094       241761094       +       Missense_Mutation       SNP     G       G       A       rs780348058             000236  NORMAL  G       G                                                                                                                               c.899C>T        p.Ser300Leu     p.S300L ENST00000366554 3/4     0       .       .       0       .       .       OPN3,missense_variant,p.Ser300Leu,ENST00000366554,NM_014322.2;OPN3,missense_variant,p.Ser221Leu,ENST00000331838,;KMO,downstream_gene_variant,,ENST00000366559,NM_003679.4;KMO,downstream_gene_variant,,ENST00000366557,;KMO,downstream_gene_variant,,ENST00000366555,;OPN3,non_coding_transcript_exon_variant,,ENST00000469376,;OPN3,non_coding_transcript_exon_variant,,ENST00000490673,;OPN3,non_coding_transcript_exon_variant,,ENST00000478849,;OPN3,non_coding_transcript_exon_variant,,ENST00000463155,;OPN3,non_coding_transcript_exon_variant,,ENST00000462265,;        A       ENSG00000054277 ENST00000366554 Transcript      missense_variant        1006/2620       899/1209        300/402 S/L     tCg/tTg rs780348058     1               -1      OPN3    HGNC    14007   protein_coding  YES     CCDS31072.1     ENSP00000355512 Q9H1Y3          UPI000000165B   NM_014322.2     deleterious(0.02)       possibly_damaging(0.692)        3/4             Transmembrane_helices:TMhelix,PROSITE_profiles:PS50262,hmmpanther:PTHR24240:SF64,hmmpanther:PTHR24240,PROSITE_patterns:PS00238,Gene3D:1.20.1070.10,Pfam_domain:PF00001,Superfamily_domains:SSF81321,Prints_domain:PR00237                                                                                                                                       MODERATE        1       SNV                                     9.415e-06       0       0       0.0001278       0       0       0       0               .       CGA     .       .       9.426e-06       1/106086        1/106208        0/9066  0/11158 1/7822  0/6612  0/54326 0/694   0/16408 PASS                                                                            acyc    exome   RETAIN  S300L   300             S       L       NA      OPN3 300        402     TCG     headandneck     saca    acyc    000236
NAA16   79612   .       GRCh37  13      41894863        41894863        +       Missense_Mutation       SNP     G       G       A       novel           000236  NORMAL  G       G                                                                                                                               c.305G>A        p.Cys102Tyr     p.C102Y ENST00000379406 4/20    0       .       .       0       .       .       NAA16,missense_variant,p.Cys102Tyr,ENST00000379406,NM_024561.4;NAA16,missense_variant,p.Cys102Tyr,ENST00000379367,;NAA16,missense_variant,p.Cys102Tyr,ENST00000403412,NM_018527.3,NM_001110798.1;NAA16,non_coding_transcript_exon_variant,,ENST00000476980,;NAA16,missense_variant,p.Cys102Tyr,ENST00000464857,;        A       ENSG00000172766 ENST00000379406 Transcript      missense_variant        629/4347        305/2595        102/864 C/Y     tGt/tAt         1               1       NAA16   HGNC    26164   protein_coding  YES     CCDS9379.1      ENSP00000368716 Q6N069  A4FU51  UPI00001B559E   NM_024561.4     deleterious(0)  probably_damaging(0.995)        4/20            Gene3D:1.25.40.10,Pfam_domain:PF13414,PIRSF_domain:PIRSF000422,PROSITE_profiles:PS50005,PROSITE_profiles:PS50293,hmmpanther:PTHR22767,hmmpanther:PTHR22767:SF5,SMART_domains:SM00028,Superfamily_domains:SSF48452                                                                                                                                       MODERATE        1       SNV                                                                                                     1       .       TGT     .       .                                                                                                                                                                       acyc    exome   RETAIN  C102Y   102             C       Y       NA      NAA16 102       864     ACA     headandneck     saca    acyc    000236
DHODH   1723    .       GRCh37  16      72056354        72056354        +       Missense_Mutation       SNP     A       A       C       novel           000236  NORMAL  A       A                                                                                                                               c.799A>C        p.Ile267Leu     p.I267L ENST00000219240 6/9     0       .       .       0       .       .       DHODH,missense_variant,p.Ile267Leu,ENST00000572887,;DHODH,missense_variant,p.Ile267Leu,ENST00000219240,NM_001361.4;DHODH,intron_variant,,ENST00000574309,;DHODH,intron_variant,,ENST00000573922,;DHODH,intron_variant,,ENST00000571392,;DHODH,downstream_gene_variant,,ENST00000573843,;DHODH,downstream_gene_variant,,ENST00000572003,;        C       ENSG00000102967 ENST00000219240 Transcript      missense_variant        820/2065        799/1188        267/395 I/L     Att/Ctt         1               1       DHODH   HGNC    2867    protein_coding  YES     CCDS42192.1     ENSP00000219240 Q02127  J3QRQ3  UPI00001FF5FB   NM_001361.4     deleterious(0)  probably_damaging(0.99) 6/9             hmmpanther:PTHR11938,Gene3D:3.20.20.70,Pfam_domain:PF01180,TIGRFAM_domain:TIGR01036,PIRSF_domain:PIRSF000164,Superfamily_domains:SSF51395                                                                                                                                       MODERATE        1       SNV                                                                                                     1       .       CAT     .       .                                                                                                                                                                       acyc    exome   RETAIN  I267L   267             I       L       NA      DHODH 267       395     ATG     headandneck     saca    acyc    000236
KRTAP21-1       337977  .       GRCh37  21      32127473        32127473        +       Missense_Mutation       SNP     C       C       T       novel           000236  NORMAL  C       C                                                                                                                               c.224G>A        p.Cys75Tyr      p.C75Y  ENST00000335093 1/1     0       .       .       0       .       .       KRTAP21-1,missense_variant,p.Cys75Tyr,ENST00000335093,NM_181619.1;      T       ENSG00000187005 ENST00000335093 Transcript      missense_variant        274/308 224/240 75/79   C/Y     tGc/tAc         1               -1      KRTAP21-1       HGNC    18945   protein_coding  YES     CCDS13606.1     ENSP00000335566 Q3LI58          UPI00001A9E4A   NM_181619.1     deleterious_low_confidence(0)   unknown(0)      1/1             hmmpanther:PTHR31294,hmmpanther:PTHR31294:SF3                                                                                                                                   MODERATE        1       SNV                                                                                                             .       GCA     .       .                                                                                                                                                                       acyc    exome   RETAIN  C75Y    75              C       Y       NA      KRTAP21-1 75    79      GCA     headandneck     saca    acyc    000236
PTPRG   5793    .       GRCh37  3       62204575        62204575        +       Nonsense_Mutation       SNP     G       G       T       novel           000380  NORMAL  G       G                                                                                                                               c.2206G>T       p.Glu736Ter     p.E736* ENST00000474889 13/30   0       .       .       0       .       .       PTPRG,stop_gained,p.Glu736Ter,ENST00000474889,NM_002841.3;PTPRG,stop_gained,p.Glu736Ter,ENST00000295874,;PTPRG,non_coding_transcript_exon_variant,,ENST00000475012,;    T       ENSG00000144724 ENST00000474889 Transcript      stop_gained     2583/9021       2206/4338       736/1445        E/*     Gag/Tag         1               1       PTPRG   HGNC    9671    protein_coding  YES     CCDS2895.1      ENSP00000418112 P23470  O60420  UPI00001AEBFB   NM_002841.3                     13/30           hmmpanther:PTHR19134,hmmpanther:PTHR19134:SF189                                                                                                                                 HIGH    1       SNV                                                                                                     1       .       GGA     .       .                                                                                                                                                                       acyc    exome   RETAIN  E736*   736             E       *       NA      PTPRG 736       1445    TCC     headandneck     saca    acyc    000380
SIPA1L2 57568   .       GRCh37  1       232539886       232539886       +       Missense_Mutation       SNP     G       G       A       rs867423404             000410  NORMAL  G       G                                                                                                                               c.4801C>T       p.His1601Tyr    p.H1601Y        ENST00000366630 19/22   0       .       .       0       .       .       SIPA1L2,missense_variant,p.His1601Tyr,ENST00000366630,;SIPA1L2,missense_variant,p.His1601Tyr,ENST00000262861,NM_020808.3;SIPA1L2,intron_variant,,ENST00000308942,;SIPA1L2,intron_variant,,ENST00000495863,;SIPA1L2,upstream_gene_variant,,ENST00000494056,;     A       ENSG00000116991 ENST00000366630 Transcript      missense_variant        5160/6690       4801/5169       1601/1722       H/Y     Cac/Tac rs867423404     1               -1      SIPA1L2 HGNC    23800   protein_coding  YES     CCDS41474.1     ENSP00000355589 Q9P2F8          UPI00001D7D6A           deleterious(0.01)       possibly_damaging(0.545)        19/22           Pfam_domain:PF11881,hmmpanther:PTHR15711:SF7,hmmpanther:PTHR15711                                                                                                                                       MODERATE        1       SNV                                                                                                             .       TGG     .       .                                                                                                                                                                       acyc    exome   RETAIN  H1601Y  1601            H       Y       NA      SIPA1L2 1601    1722    CCA     headandneck     saca    acyc    000410
FGD6    55785   .       GRCh37  12      95483365        95483365        +       Missense_Mutation       SNP     T       T       C       novel           000410  NORMAL  T       T                                                                                                                               c.3958A>G       p.Ile1320Val    p.I1320V        ENST00000343958 18/21   0       .       .       0       .       .       FGD6,missense_variant,p.Ile1320Val,ENST00000343958,NM_018351.3;FGD6,missense_variant,p.Ile1320Val,ENST00000546711,;FGD6,missense_variant,p.Ile64Val,ENST00000548069,;FGD6,downstream_gene_variant,,ENST00000549499,;FGD6,downstream_gene_variant,,ENST00000551521,;FGD6,3_prime_UTR_variant,,ENST00000451107,;  C       ENSG00000180263 ENST00000343958 Transcript      missense_variant        4182/9288       3958/4293       1320/1430       I/V     Atc/Gtc         1               -1      FGD6    HGNC    21740   protein_coding  YES     CCDS31878.1     ENSP00000344446 Q6ZV73  F8VY01  UPI00001FB2F4   NM_018351.3     tolerated(0.15) benign(0.168)   18/21           hmmpanther:PTHR12673:SF12,hmmpanther:PTHR12673,Superfamily_domains:SSF50729                                                                                                                                     MODERATE        1       SNV                                                                                                             .       ATT     .       .                                                                                                                                                                       acyc    exome   RETAIN  I1320V  1320            I       V       NA      FGD6 1320       1430    ATT     headandneck     saca    acyc    000410
jblachly commented 3 years ago

LMK if adding an if condition to skip the second, un-hash-marked header line fixes it

Kekananen commented 3 years ago

Apologies in the delay in updating on this; after adding in a few different style of if statements (though not at once), the result either stayed the same. The first case being that it seems to be read in Unexpected 'S' when converting from type char[] to type int still even with if statements to element the line starting with Hugo or Hugo_Symbol as I tried various ways in the conditionals.

In the case of editing the file directly to have a # for the second header line then Segmentation fault (core dumped) was thrown. This is something I may have already tracked down prior to line auto o = this.chainsByContig[contig].findOverlapsWith(i); // returns Node*(s) assuming the previous fixes I did are implemented.

This is with the same dataset as above or with the entire set from cosmic. The only change being a #Hugo_Symbol on line 2 if trying the core dumped error.

jblachly commented 3 years ago

Can you please reduce the failure case to a single row

Kekananen commented 3 years ago

If you're referring to the dataset:

#version 2.4
Hugo_Symbol     Entrez_Gene_Id  Center  NCBI_Build      Chromosome      Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele        Tumor_Seq_Allele1       Tumor_Seq_Allele2       dbSNP_RS        dbSNP_Val_Status        Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode     Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1        Tumor_Validation_Allele2        Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2   Verification_Status     Validation_Status       Mutation_Status Sequencing_Phase        Sequence_Source Validation_Method       Score   BAM_File        Sequencer       Tumor_Sample_UUID       Matched_Norm_Sample_UUID        HGVSc   HGVSp   HGVSp_Short     Transcript_ID   Exon_Number     t_depth t_ref_count     t_alt_count     n_depth n_ref_count     n_alt_count     all_effects     Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      ALLELE_NUM      DISTANCE        STRAND_VEP      SYMBOL  SYMBOL_SOURCE   HGNC_ID BIOTYPE CANONICAL       CCDS    ENSP    SWISSPROT       TREMBL  UNIPARC RefSeq  SIFT    PolyPhen        EXON    INTRON  DOMAINS AF      AFR_AF  AMR_AF  ASN_AF  EAS_AF  EUR_AF  SAS_AF  AA_AF   EA_AF   CLIN_SIG        SOMATIC PUBMED  MOTIF_NAME      MOTIF_POS       HIGH_INF_POS    MOTIF_SCORE_CHANGE      IMPACT  PICK    VARIANT_CLASS   TSL     HGVS_OFFSET     PHENO   MINIMISED       ExAC_AF ExAC_AF_AFR     ExAC_AF_AMR     ExAC_AF_EAS     ExAC_AF_FIN     ExAC_AF_NFE     ExAC_AF_OTH     ExAC_AF_SAS     GENE_PHENO      FILTER  flanking_bps    variant_id      variant_qual    ExAC_AF_Adj     ExAC_AC_AN_Adj  ExAC_AC_AN      ExAC_AC_AN_AFR  ExAC_AC_AN_AMR  ExAC_AC_AN_EAS  ExAC_AC_AN_FIN  ExAC_AC_AN_NFE  ExAC_AC_AN_OTH  ExAC_AC_AN_SAS  ExAC_FILTER     gnomAD_AF       gnomAD_AFR_AF   gnomAD_AMR_AF   gnomAD_ASJ_AF   gnomAD_EAS_AF   gnomAD_FIN_AF   gnomAD_NFE_AF   gnomAD_OTH_AF   gnomAD_SAS_AF   TUMORTYPE       PLATFORM        judgement       Amino_Acid_Change       Amino_Acid_Position     Protein_Lenght  Reference_Amino_Acid    Variant_Amino_Acid      allele_freq     tm      Amino_Acid_Length       Ref_Tri oncotree_organtype      oncotree_parent oncotree_detailed       Master_ID
WARS2   10352   .       GRCh37  1       119575617       119575617       +       Missense_Mutation       SNP     C       C       T       novel           000236  NORMAL  C       C                                                                                                                               c.1000G>A       p.Val334Ile     p.V334I ENST00000235521 6/6     0       .       .       0       .       .       WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,;     T       ENSG00000116874 ENST00000235521 Transcript      missense_variant        1027/2800       1000/1083       334/360 V/I     Gtt/Att         1               -1      WARS2   HGNC    12730   protein_coding  YES     CCDS900.1       ENSP00000235521 Q9UGM6  B7Z5X7  UPI000004A002   NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015)   6/6             Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233                                                                                                                                 MODERATE        1       SNV                                                                                                             .       ACC     .       .                                                                                                                                                                       acyc    exome   RETAIN  V334I   334             V       I       NA      WARS2 334       360     ACC     headandneck     saca    acyc    000236

Then the above throws: Unexpected 'S' when converting from type char[] to type int

jblachly commented 3 years ago

commits up to 9f407db59d0d2351ec3ee4dadefa3c7402299ab4 should fix your issues ; there may be other problems for larger indels spanning many nt, but this should work for all SNVs. Let me know

You'll need latest commit in dhtslib because I discovered coordinate translation bug in the faidx fetch sequence routine

Kekananen commented 3 years ago

Using htslib up to that commit, it's now throwing:

[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
Segmentation fault (core dumped)

This is using any file including the prior one or multi liner test file, or the original.

#version 2.4
Hugo_Symbol     Entrez_Gene_Id  Center  NCBI_Build      Chromosome      Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele        Tumor_Seq_Allele1       Tumor_Seq_Allele2       dbSNP_RS        dbSNP_Val_Status        Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode     Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1        Tumor_Validation_Allele2        Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2   Verification_Status     Validation_Status       Mutation_Status Sequencing_Phase        Sequence_Source Validation_Method       Score   BAM_File        Sequencer       Tumor_Sample_UUID       Matched_Norm_Sample_UUID        HGVSc   HGVSp   HGVSp_Short     Transcript_ID   Exon_Number     t_depth t_ref_count     t_alt_count     n_depth n_ref_count     n_alt_count     all_effects     Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      ALLELE_NUM      DISTANCE        STRAND_VEP      SYMBOL  SYMBOL_SOURCE   HGNC_ID BIOTYPE CANONICAL       CCDS    ENSP    SWISSPROT       TREMBL  UNIPARC RefSeq  SIFT    PolyPhen        EXON    INTRON  DOMAINS AF      AFR_AF  AMR_AF  ASN_AF  EAS_AF  EUR_AF  SAS_AF  AA_AF   EA_AF   CLIN_SIG        SOMATIC PUBMED  MOTIF_NAME      MOTIF_POS       HIGH_INF_POS    MOTIF_SCORE_CHANGE      IMPACT  PICK    VARIANT_CLASS   TSL     HGVS_OFFSET     PHENO   MINIMISED       ExAC_AF ExAC_AF_AFR     ExAC_AF_AMR     ExAC_AF_EAS     ExAC_AF_FIN     ExAC_AF_NFE     ExAC_AF_OTH     ExAC_AF_SAS     GENE_PHENO      FILTER  flanking_bps    variant_id      variant_qual    ExAC_AF_Adj     ExAC_AC_AN_Adj  ExAC_AC_AN      ExAC_AC_AN_AFR  ExAC_AC_AN_AMR  ExAC_AC_AN_EAS  ExAC_AC_AN_FIN  ExAC_AC_AN_NFE  ExAC_AC_AN_OTH  ExAC_AC_AN_SAS  ExAC_FILTER     gnomAD_AF       gnomAD_AFR_AF   gnomAD_AMR_AF   gnomAD_ASJ_AF   gnomAD_EAS_AF   gnomAD_FIN_AF   gnomAD_NFE_AF   gnomAD_OTH_AF   gnomAD_SAS_AF   TUMORTYPE       PLATFORM        judgement       Amino_Acid_Change       Amino_Acid_Position     Protein_Lenght  Reference_Amino_Acid    Variant_Amino_Acid      allele_freq     tm      Amino_Acid_Length       Ref_Tri oncotree_organtype      oncotree_parent oncotree_detailed       Master_ID
WARS2   10352   .       GRCh37  1       119575617       119575617       +       Missense_Mutation       SNP     C       C       T       novel           000236  NORMAL  C       C                                                                                                                               c.1000G>A       p.Val334Ile     p.V334I ENST00000235521 6/6     0       .       .       0       .       .       WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,;     T       ENSG00000116874 ENST00000235521 Transcript      missense_variant        1027/2800       1000/1083       334/360 V/I     Gtt/Att         1               -1      WARS2   HGNC    12730   protein_coding  YES     CCDS900.1       ENSP00000235521 Q9UGM6  B7Z5X7  UPI000004A002   NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015)   6/6             Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233                                                                                                                                 MODERATE        1       SNV                                                                                                             .       ACC     .       .                                                                                                                                                                       acyc    exome   RETAIN  V334I   334             V       I       NA      WARS2 334       360     ACC     headandneck     saca    acyc    000236
Kekananen commented 3 years ago

Okay, after rechecking everything: htslib is at the newest commit 238600d2e0401b893ead71f967f0488f5d144e63 dhtslib is also at the newest commit 85493f200541037604820608c5ad041c74a20bf9 path is set at LD_LIBRARY_PATH=/tools/htslib/lib/ LIBRARY_PATH=/tools/htslib/lib/

command as follows: ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain.gz -g ../../databases/GRCh38.d1.vd1.fa -b GRCh38 -i ../test.maf -u unmatched.hotspots.maf > test.maf

error is then thrown as follows:

[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
std.conv.ConvException@/home/OSUMC.EDU/kana18/dlang/ldc-1.24.0/bin/../import/std/conv.d(2382): Unexpected end of input when converting from type char[] to type long
----------------
??:? [0x55d625b29065]
??:? [0x55d625b349e6]
??:? [0x55d625b16b7d]
??:? [0x55d625ad9f1b]
??:? [0x55d625ad9d57]
??:? [0x55d625ac5e15]
??:? [0x55d625abb451]
??:? [0x55d625ac151c]
??:? [0x55d625abee7b]
??:? [0x55d625b1684b]
??:? [0x55d625b16747]
??:? [0x55d625b1659d]
??:? __libc_start_main [0x7fe56474bbf6]
??:? [0x55d625ab9ef9]
jblachly commented 3 years ago

Requirements Chain file If you are working with human data, you can quickly grab hg19 to hg38 (UCSC-style contig naming: chr1, chrM) and GRCh37 to GRCh38 (Ensembl-style contig naming: 1, MT) by issuing make chains. Chainfiles will be placed in resources/, and for the time being must be un-gzipped.

jblachly commented 3 years ago

Also note that your GRCh38 fasta file uses chr1 style naming, so I had to use a different destination genome. Alternatiively, you can also find chain files that in addition to lifting over coordinates change contig names, so the chain file would be along the lines of GRCh37_to_Hg38.chain

blac96@cancer-f5rdy12:/home/OSUMC.EDU/kana18/tools/swiftover$ ./swiftover -t maf -c ~/source/blachlylab/swiftover/resources/GRCh37_to_GRCh38.chain -g ~/source/blachlylab/swiftover/resources/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -b ncbi_build -i ../test.maf -u /tmp/unmatched.hotspots.maf -o /tmp/test.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[I::swiftover.maf.liftMAF] Matched 2 records (0 multiply, skipped) and 0 unmatched; 2 REF allele changes
blac96@cancer-f5rdy12:/home/OSUMC.EDU/kana18/tools/swiftover$ wc -l ../test.maf
4 ../test.maf
jblachly commented 3 years ago

However there should not be any REF allele changes . output MAF shows dinucleotide "ref" which I fixed in the latest commit for dhtslib. Not sure what is going on

Secondly I noticed that the output MAF file only had one row output, so I am not sure what is going on there yet

jblachly commented 3 years ago

OK the first problem (REF changing when it shnouoldn't ; dinucleotides -- which I already fixed) this is an issue where you have updated dub.selections.json to use the ~develop branch (which is fine) , but since it is a remote endpoint it was not updated.

To best deal with this for now, change your dub.json file to read

        "dhtslib":      "~develop",
        "dklib":        "~>0.1.1",
        "intervaltree": "~generic_basicinterval"

then run dub upgrade.

Now rebuild

jblachly commented 3 years ago

Works fine (except I need to write a header to the unmatched MAF)

blac96@cancer-f5rdy12:~/source/blachlylab/swiftover$ ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain -g resources/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -b GRCh38 -i resources/test.maf -o /tmp/test.maf -u /tmp/unmatched.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[I::swiftover.maf.liftMAF] Matched 8 records (0 multiply, skipped) and 0 unmatched; 0 REF allele changes
blac96@cancer-f5rdy12:~/source/blachlylab/swiftover$ wc -l resources/test.maf /tmp/test.maf /tmp/unmatched.maf
   10 resources/test.maf
   10 /tmp/test.maf
    0 /tmp/unmatched.maf
Kekananen commented 3 years ago

Thanks! This works on the full /databases/cancerhotspots/cancerhotspots.v2.maf with the following warnings. As you said there shouldn't be any REF allele changes however, especially not 2511300 of them seeing as there were 2511317 matches in total. I've updated as you've said I've changed the dub.json with the instructed lines and built.

Switching to Homo_sapiens.GRCh38.dna.primary_assembly.fa for -g

The test runs as you show:

[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[I::swiftover.maf.liftMAF] Matched 1 records (0 multiply, skipped) and 0 unmatched; 0 REF allele changes

The whole file does not however:

[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
......
......
[I::swiftover.maf.liftMAF] Matched 2511317 records (17 multiply, skipped) and 7731 unmatched; 2511300 REF allele changes

This will induce the warnings. There is also a larger test file called noRefTest.maf in the tools directory that induces 49 of these warnings.

#version 2.4
Hugo_Symbol     Entrez_Gene_Id  Center  NCBI_Build      Chromosome      Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele        Tumor_Seq_Allele1       Tumor_Seq_Allele2       dbSNP_RS        dbSNP_Val_Status        Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode     Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1        Tumor_Validation_Allele2        Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2   Verification_Status     Validation_Status       Mutation_Status Sequencing_Phase        Sequence_Source Validation_Method       Score   BAM_File        Sequencer       Tumor_Sample_UUID       Matched_Norm_Sample_UUID        HGVSc   HGVSp   HGVSp_Short     Transcript_ID   Exon_Number     t_depth t_ref_count     t_alt_count     n_depth n_ref_count     n_alt_count     all_effects     Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      ALLELE_NUM      DISTANCE        STRAND_VEP      SYMBOL  SYMBOL_SOURCE   HGNC_ID BIOTYPE CANONICAL       CCDS    ENSP    SWISSPROT       TREMBL  UNIPARC RefSeq  SIFT    PolyPhen        EXON    INTRON  DOMAINS AF      AFR_AF  AMR_AF  ASN_AF  EAS_AF  EUR_AF  SAS_AF  AA_AF   EA_AF   CLIN_SIG        SOMATIC PUBMED  MOTIF_NAME      MOTIF_POS       HIGH_INF_POS    MOTIF_SCORE_CHANGE      IMPACT  PICK    VARIANT_CLASS   TSL     HGVS_OFFSET     PHENO   MINIMISED       ExAC_AF ExAC_AF_AFR     ExAC_AF_AMR     ExAC_AF_EAS     ExAC_AF_FIN     ExAC_AF_NFE     ExAC_AF_OTH     ExAC_AF_SAS     GENE_PHENO      FILTER  flanking_bps    variant_id      variant_qual    ExAC_AF_Adj     ExAC_AC_AN_Adj  ExAC_AC_AN      ExAC_AC_AN_AFR  ExAC_AC_AN_AMR  ExAC_AC_AN_EAS  ExAC_AC_AN_FIN  ExAC_AC_AN_NFE  ExAC_AC_AN_OTH  ExAC_AC_AN_SAS  ExAC_FILTER     gnomAD_AF       gnomAD_AFR_AF   gnomAD_AMR_AF   gnomAD_ASJ_AF   gnomAD_EAS_AF   gnomAD_FIN_AF   gnomAD_NFE_AF   gnomAD_OTH_AF   gnomAD_SAS_AF   TUMORTYPE       PLATFORM        judgement       Amino_Acid_Change       Amino_Acid_Position     Protein_Lenght  Reference_Amino_Acid    Variant_Amino_Acid      allele_freq     tm      Amino_Acid_Length       Ref_Tri oncotree_organtype      oncotree_parent oncotree_detailed       Master_ID
C3      718     .       GRCh37  19      6697759 6697760 +       Frame_Shift_Ins INS     -       -       A       novel           01-P034 NA      A       A                                                                                                                               c.2486dupT      p.Phe830LeufsTer94      p.F830Lfs*94    ENST00000245907 20/41   0       .       .       0       .       .       C3,frameshift_variant,p.Phe830LeufsTer94,ENST00000245907,NM_000064.2;C3,non_coding_transcript_exon_variant,,ENST00000602053,;C3,upstream_gene_variant,,ENST00000598805,;C3,upstream_gene_variant,,ENST00000594005,;     A       ENSG00000125730 ENST00000245907 Transcript      frameshift_variant      2579-2580/5263  2486-2487/4992  829/1663        F/FX    ttc/ttTc                1               -1      C3      HGNC    1318    protein_coding  YES     CCDS32883.1     ENSP00000245907 P01024  Q6LDJ0,M0R1Q1   UPI000013EC9B   NM_000064.2                     20/41           Pfam_domain:PF00207,hmmpanther:PTHR11412,hmmpanther:PTHR11412:SF81
Kekananen commented 3 years ago

The first ~200 lines in the original cancer hotspots maf don't cause this warning to appear by the way.

jblachly commented 3 years ago

So this is a warning that indicates something called the "tumor" allele in the MAF was not the refrence allele in the original genome (actually I may not be making that check explicitly -- will add), but IS now the reference allele in the new, destination genome. This happens from time to time, but is surprising in the context of what we thought was a somatic mutation hotspot.

That being said, it is not an error which is why it is emitted as a warning. However, I should possibly find some other way to flag these so they can be examined

jblachly commented 3 years ago

This is a problem though:

[I::swiftover.maf.liftMAF] Matched 2511317 records (17 multiply, skipped) and 7731 unmatched; 2511300 REF allele changes
Kekananen commented 3 years ago

Ah apologies, my brain translated it to error as due to the amount of them it was no longer a warning in my eyes. I would assume Swiftover is classifying tumor allele as the ALT.

Also I agree that maybe sometimes a few REF-->ALT changes would be expected, the exact number I'm not sure, but it does seem odd that it changes in a recombinant hotspot location. Due to their rapid evolution in general, I would think those that are stable enough to be cataloged as what I would call "population hotspots" or "sub-population hotspots" (depending on how in-depth the originating population group was) wouldn't have as a dramatic change in a liftover as say if this was a random non-ref individual.

Any-who, I'll dig around about whether some ways/methods to save some of the unmatched if any corrections can be done after.

jblachly commented 3 years ago

So make sure you have recompiled with the latest dhtslib and intervaltree dependencies.

Here is what I get:

blac96@cancer-f5rdy12:~/source/blachlylab/swiftover$  time ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain -g resources/Homo_sapiens
.GRCh38.dna.toplevel.fa.gz -b GRCh38 -i ~kana18/databases/cancerhotspots/cancerhotspots.v2.maf -o /tmp/hotspots_splay.maf -u /tmp/unmapped_
splay.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[W::swiftover.maf.liftMAF] New REF matches tumor allele
...
[I::swiftover.maf.liftMAF] Matched 2511317 records (17 multiply, skipped) and 7731 unmatched; 56443 REF allele changes

real    13m2.187s
user    12m17.739s
sys     0m33.966s

When I look at your output MAF, it shows tha tyou are still running a binary compiled with a version of dhtslib that misses the latest commit, which fixes the coordinate translation problem. I have no idea why this is.

jblachly commented 3 years ago

@Kekananen I recompiled it for you. not sure wh y you had old cached lib functions. It works now Make sure you are doing conda deactivate (or preferably just remove the auto activation of your env from your login files) before trying to compile things

kana18@cancer-f5rdy12:~/tools/swiftover$ ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain -g ~blac96/source/blachlylab/swiftover/resources/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -b GRCh38 -i <(head -100000 ../../databases/cancerhotspots/cancerhotspots.v2.maf) -o /tmp/kana
_hotspots.maf -u /tmp/kana_unmatched.maf
[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[I::swiftover.maf.liftMAF] Reading chainfile
[I::swiftover.maf.liftMAF] Reading MAF
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[W::swiftover.maf.liftMAF] New REF matches tumor allele
[I::swiftover.maf.liftMAF] Matched 99692 records (0 multiply, skipped) and 306 unmatched; 10905 REF allele changes

Note the reason there are about 10% REF allele changes is actually because of the way MAF represents insertions specifically (should be represented as - , despite that they include start and end coordinates corresponding to the lenght of the insertion, which seems strange. I made a new issue in the repo for this.

Kekananen commented 3 years ago

I think I've removed conda once again from my .bashrc, though I've removed it prior to this so I must be doing something to re add it's auto activation since it crept back in. I'll take some time this weekend to do some house keeping and make sure everything is tidy once again.

I'd be interested in knowing how you recompiled it differently from me so I can also compile later in that way if need be. By the hint of it though, sounds like you might have just partially cleared a cache somewhere.

Thanks for getting this implemented.

jblachly commented 3 years ago

Commit e6990e796f7bfafe209c08eb5e8f27da8c68df2e brings MAF support close to completion

Kekananen commented 3 years ago

With the newest commit after the update using the same example file above the following is thrown:

[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[W::swiftover.maf.liftMAF] Only fields build, chr, start, end, ref/tumor1/tumor2 are updated.
[W::swiftover.maf.liftMAF] A reference allele change might invalidate other fields.
[I::swiftover.maf.liftMAF] Reading chainfile
core.exception.AssertError@source/swiftover/chain.d(294): Assertion failure
----------------
??:? [0x557ecbedf6a5]
??:? [0x557ecbeeb026]
??:? [0x557ecbecd1bd]
??:? [0x557ecbec4fcc]
chain.d:294 [0x557ecbe5415f]
chain.d:557 [0x557ecbe3fe6d]
maf.d:132 [0x557ecbe48b70]
main.d:112 [0x557ecbe4642c]
??:? [0x557ecbecce8b]
??:? [0x557ecbeccd87]
??:? [0x557ecbeccbdd]
/home/OSUMC.EDU/kana18/dlang/ldc-1.24.0/bin/../import/core/internal/entrypoint.d:42 [0x557ecbe4a8e4]
??:? __libc_start_main [0x7f33be3c3bf6]
??:? [0x557ecbe3df59]

Debug mode shows this

[I::swiftover.main.main] 🚀 swift liftover (version: splay trees)
[W::swiftover.maf.liftMAF] Only fields build, chr, start, end, ref/tumor1/tumor2 are updated.
[W::swiftover.maf.liftMAF] A reference allele change might invalidate other fields.
[I::swiftover.maf.liftMAF] Reading chainfile
std.conv.ConvException@/home/OSUMC.EDU/kana18/dlang/ldc-1.24.0/bin/../import/std/conv.d(2382): Unexpected end of input when converting from type char[] to type
----------------

Using ./swiftover -t maf -c resources/GRCh37_to_GRCh38.chain.gz -g $HOME/databases/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa -b GRCh38 -i $HOME/tools/test.maf -u $HOME/tools/unmatched.maf -o $HOME/tools/test.38.maf

I have checked and have the latest dependencies and this persists on the larger file. Let me know if I can help in any way to fix this!

jblachly commented 3 years ago

https://github.com/blachlylab/swiftover/issues/23#issuecomment-737426268

Requirements Chain file If you are working with human data, you can quickly grab hg19 to hg38 (UCSC-style contig naming: chr1, chrM) and GRCh37 to GRCh38 (Ensembl-style contig naming: 1, MT) by issuing make chains. Chainfiles will be placed in resources/, and for the time being must be un-gzipped.

Probably I should add a test to see if the filename ends in .gz, .bz2, etc.

Kekananen commented 3 years ago

Indeed, this was a quick resolve. The file is lifted over beautifully now.