AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
153 stars 51 forks source link

SigProfilerExtractor, no ID signatures, ValueError: operands could not be broadcast together with shapes (78,0) (0,0) #130

Closed oliverartz closed 2 years ago

oliverartz commented 2 years ago

Dear developers,

I have been using SigProfilerExtractor to extract mutational signatures for a number of VCF files. For two of those files, no ID signatures were produced and the script exited with an error:

ValueError: operands could not be broadcast together with shapes (78,0) (0,0)

I do, however, get the deconvolution for SBS and DBS for those samples.

The VCF files contain INDELs as the first lines of the output indicate Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 2.3 seconds. Starting matrix generation for INDELs...Completed! Elapsed time: 1.8 seconds. Matrices generated for 1 samples with 0 errors. Total of 395 SNVs, 4 DINUCs, and 109 INDELs were successfully analyzed.

This is the content of the JOB_METADATA.txt

-------System Info-------
Operating System Name: Darwin
Nodename: [REMOVED]
Release: 21.4.0
Version: Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64

-------Python and Package Versions------- 
Python Version: 3.8.13
SigProfilerExtractor Version: 1.1.7
SigProfilerPlotting Version: 1.2.1
SigProfilerMatrixGenerator Version: 1.2.5
Pandas version: 1.4.2
Numpy version: 1.22.3
Scipy version: 1.8.0
Scikit-learn version: 1.1.0

--------------EXECUTION PARAMETERS--------------
INPUT DATA
    input_type: vcf
    output: [REMOVED]
    input_data: [REMOVED]
    reference_genome: mm10
    context_types: SBS96,DBS78,ID83
    exome: True
NMF REPLICATES
    minimum_signatures: 1
    maximum_signatures: 10
    NMF_replicates: 100
NMF ENGINE
    NMF_init: random
    precision: single
    matrix_normalization: gmm
    resample: True
    seeds: random
    min_NMF_iterations: 10,000
    max_NMF_iterations: 1,000,000
    NMF_test_conv: 10,000
    NMF_tolerance: 1e-15
CLUSTERING
    clustering_distance: cosine
EXECUTION
    cpu: 8; Maximum number of CPU is 8
    gpu: False
Solution Estimation
    stability: 0.8
    min_stability: 0.2
    combined_stability: 1.0
COSMIC MATCH
    opportunity_genome: GRCh37
\cosmic_version: 3.1
    nnls_add_penalty: 0.05
    nnls_remove_penalty: 0.01
    initial_remove_penalty: 0.05
    de_novo_fit_penalty: 0.02
    refit_denovo_signatures: True
    collapse_to_SBS96: True

-------Analysis Progress------- 
[2022-06-07 15:38:25] Analysis started: 

##################################

[2022-06-07 15:38:29] Analysis started for SBS96. Matrix size [96 rows x 1 columns]

[2022-06-07 15:38:29] Normalization GMM with cutoff value set at 9600

[2022-06-07 15:39:20] SBS96 de novo extraction completed for a total of 1 signatures! 
Execution time:0:00:50

##################################

[2022-06-07 15:39:30] Analysis started for DBS78. Matrix size [78 rows x 0 columns]

[2022-06-07 15:39:30] Normalization GMM with cutoff value set at 7800

Thanks for your help!

oliverartz commented 2 years ago

It seems like the deconvolution of the DBS78 causes the issue. I reckon there are not enough DINUCs in the VCF, so the entire algorithm stops before getting to the IND step? If I try to manually skip the DINUC step by setting context_type="96,ID", I get the following error: Error in py_call_impl(callable, dots$args, dots$keywords) : KeyError: '96A'. I am not quite sure how to tackle this issue. Has anyone reported this problem before and is there a known fix?

mdbarnesUCSD commented 2 years ago

Hi @oliverartz,

It looks like there may have been some issues with generating your matrices. Did you install the mm10 reference genome? If you did and are still experiencing the issue could you please e-mail me some VCF files that can reproduce this?

Thanks!

oliverartz commented 2 years ago

Thanks for the help! I double-checked the mm10 reference genome and it is installed. It is interesting that I only get this problem with certain VCFs. Your response prompted me to generate the matrices first using SigProfilerMatrixGenerator and use those as input for SigProfilerExtractor, which worked great for SBS96, DBS78, and ID83. Using SigProfilerExtractor on the VCF does not work though. I have tried using the R wrapper first and then the Python version. I am sending you the VCF via email.

Thanks again 👍