AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
153 stars 51 forks source link

Error generating outputs using WES data #86

Closed AldhairMedico closed 3 years ago

AldhairMedico commented 3 years ago

Dear SigProfilerExtractor developers, I've been using your tool with no problems until now. I'm trying to get some mutational signatures from whole-exome sequencing (WES) data I used Mutect2, Lancet, and Strelka2 for variant calling. I had no problems with the first two software inputs. However, when I tried to run the software with Strelka2 VCFs it doesn't work.

Input command: sig.sigProfilerExtractor("vcf", "results_wes_relevant_strelka2", "/home/aldhair/Downloads/rockefeller_projects/d_mutational_signatures/SigProfilerExtractor/wes_relevant_strelka2", reference_genome="GRCh38", opportunity_genome = "GRCh38", exome = True, minimum_signatures=1, maximum_signatures=25, nmf_replicates=100, resample = True, batch_size=1, cpu=-1, gpu=False, nmf_init="nndsvd_min", precision= "single", matrix_normalization= "gmm", min_nmf_iterations= 10000, max_nmf_iterations=1000000, nmf_test_conv= 10000, nmf_tolerance= 1e-15, nnls_add_penalty=0.05, nnls_remove_penalty=0.01, initial_remove_penalty=0.05, de_novo_fit_penalty=0.02, get_all_signature_matrices= False) Log Error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/aldhair/Downloads/rockefeller_projects/d_mutational_signatures/SigProfilerExtractor/SigProfilerExtractor/sigpro.py", line 609, in sigProfilerExtractor data = datadump.SigProfilerMatrixGeneratorFunc(project_name, refgen, project, exome=exome, bed_file=None, chrom_based=False, plot=False, gs=False) File "/home/aldhair/.local/lib/python3.8/site-packages/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGeneratorFunc.py", line 363, in SigProfilerMatrixGeneratorFunc snv, indel, skipped, samples = convertIn.convertVCF(project, vcf_path, genome, output_path, ncbi_chrom, log_file) File "/home/aldhair/.local/lib/python3.8/site-packages/SigProfilerMatrixGenerator/scripts/convert_input_to_simple_files.py", line 58, in convertVCF for lines in f: File "/usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Log output: `THIS FILE CONTAINS THE METADATA ABOUT SYSTEM AND RUNTIME

-------System Info------- Operating System Name: Linux Nodename: aldhair-G7-7588 Release: 5.8.0-63-generic Version: #71~20.04.1-Ubuntu SMP Thu Jul 15 17:46:08 UTC 2021

-------Python and Package Versions------- Python Version: 3.8.10 SigProfilerMatrixGenerator Version: 1.1.28 SigProfilerPlotting version: 1.1.15 matplotlib version: 3.3.4 statsmodels version: 0.12.2 scipy version: 1.6.3 pandas version: 1.2.4 numpy version: 1.20.3

-------Vital Parameters Used for the execution ------- Project: wes_relevant_strelka2 Genome: GRCh38 Input File Path: /home/aldhair/Downloads/rockefeller_projects/d_mutational_signatures/SigProfilerExtractor/wes_relevant_strelka2/ exome: True bed_file: None chrom_based: False plot: False tsb_stat: False seqInfo: True

-------Date and Time Data------- Date and Clock time when the execution started: 2021-07-29 19:11:36.349995

-------Runtime Checkpoints------- `

I suspect this is due to bcftools concat, I used it for merging SNVs and Indels VCFs in a single VCF per comparison (6 in total) Can I use each SNV / Indel VCF without merging them? Or it is going to be understood as 12 comparisons instead of 6?

I used MergeVCF from GATK for merging VCFs by chromosome into a single VCF and I had no problems running SigProfilerExtractor.

Aldhair

mdbarnesUCSD commented 3 years ago

Hi @AldhairMedico,

It looks like there is an issue with how your input VCF file is being processed by SigProfilerMatrixGenerator. Could you please check the contents of your input VCF files? There is an example of a functional VCF file on SigProfilerMatrixGenerator's wiki.

Each sample will need to be saved as its own VCF file. The Indels/SNVs can be separate or combined, as they are analyzed separately in SigProfilerMatrixGenerator and SigProfilerExtractor.

Thanks, Mark

mdbarnesUCSD commented 3 years ago

Please re-open this issue if the problem still exists.