getzlab / SignatureAnalyzer

Updated SignatureAnalyzer-GPU with mutational spectra & RNA expression compatibility.
MIT License
71 stars 21 forks source link

"Unusual context" error #29

Closed julieefeusier closed 3 years ago

julieefeusier commented 3 years ago

Hi,

I converted Strelka VCF files to maf format using annovar and maftools. I removed non-exonic somatic variants from the maf file prior to running SignatureAnalyzer. I'm running into an error with the maf format. Here is my command:

import signatureanalyzer as sa import pandas as pd maf_df = pd.read_csv( "Input_351_hg38_multianno_exonic_SNP_min.maf", sep='\t').loc[:,[ 'Hugo_Symbol', 'Tumor_Sample_Barcode', 'Chromosome', 'Start_Position', 'Reference_Allele', 'Tumor_Seq_Allele2', 'Variant_Type' ]]

_,spectra_sbs = sa.spectra.get_spectra_from_maf(maf_df, cosmic='cosmic3_exome', hgfile='hg38.2bit')

  * Mapping contexts: 17327 / 17328

Traceback (most recent call last): File "/uufs/env/lib/python3.8/site-packages/signatureanalyzer/spectra.py", line 114, in get_spectra_from_maf maf['context96.num'] = contig.apply(context96.getitem) File "/uufs/env/lib/python3.8/site-packages/pandas/core/series.py", line 4108, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer KeyError: '-AAG'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/uufs/env/lib/python3.8/site-packages/signatureanalyzer/spectra.py", line 116, in get_spectra_from_maf raise KeyError('Unusual context: ' + str(e)) KeyError: "Unusual context: '-AAG'"

I get the same error when I run:

signatureanalyzer Input_351_hg38_multianno_exonic_SNP_min.maf --hg_build hg38.2bit -n 10 --cosmic cosmic3_exome --objective poisson --max_iter 30000 --prior_on_H L1 --prior_on_W L1

I don't have "AAG" in my file. Any thoughts? Thanks!

jcha40 commented 3 years ago

Hello,

Looking at the traceback, it is likely that in the maf file there is an insertion that is marked as a SNP in the Variant_Type column. The context for single-base substitutions is 4 characters long and is of the format (ref) (alt) (ref-1) (ref+1), created from columns in the maf and the hg reference sequence. The traceback indicates that the reference base is '-', an insertion.

julieefeusier commented 3 years ago

Thanks for helping me, that was it! There were about 200 instances total in both reference/alternative columns. Thanks!