AlexandrovLab / SigProfilerSingleSample

SigProfilerSingleSample allows attributing a known set of mutational signatures to an individual sample. The tool identifies the activity of each signature in the sample and assigns the probability for each signature to cause a specific mutation type in the sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
23 stars 2 forks source link

Error when running SigProfilerSingleSample. #17

Closed lincj1994 closed 2 years ago

lincj1994 commented 2 years ago

Hi. @marcos-diazg @mishugeb @itsvenu I want to assign activities of known COSMIC signatures to each sample and I have prepared data (by MatrixGenerator) and sig_database (from the COSMIC website) according to the example input file and imported them using the following codes.

from sigproSS import spss
import pandas as pd
sig_db = pd.read_csv('COSMIC_v3.2_SBS_GRCh38.csv')
data=pd.read_csv("/home/lcj/lincj/CBCGA/SigProfiler220414/input/CBCGA.SBS96.all.csv")

Below are the first few rows of each data frame.

data
   Mutation type Trinucleotide  ACEJ  ACKR  ACSK  ACYZ  AEFC  AEUJ  AEXF  AGNS  AGRL  AGVN  AILX  AIWT  AJEH  AJYK  AKVJ  ...  ZRET  ZROW  ZSRB  ZSRD  ZTPX  ZUCD  ZUXJ  ZVCO  ZVDR  ZWCY  ZXHO  ZXLK  ZYFK  ZYIG  ZYQH  ZYTB  ZYWC
0            C>A           ACA     1     0     0     0     0     0     0     4     1     1     7     1     1     0     1  ...     0     1     2     3     0     0     1     2     0     0     1     0     0     0     0     0     0
1            C>A           ACC     2     0     0     1     0     3     1     2     1     0     9     6     0     0     0  ...     0     0     3     3     0     2     1     0     1     1     3     0     1     0     0     0     1
2            C>A           ACG     0     0     1     0     0     0     1     1     0     0     2     0     0     1     0  ...     0     0     0     4     0     0     0     0     0     0     0     1     0     0     0     0     0
3            C>A           ACT     0     0     0     0     0     3     0     1     0     0     6     5     0     0     0  ...     0     0     0     4     2     0     1     0     0     0     2     0     0     0     1     0     0
4            C>G           ACA     1     0     0     1     0     0     0     1     0     0     9     5     0     2     0  ...     0     0     0    11     0     2     0     0     0     0     0     0     0     1     1     0     0
..           ...           ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
91           T>C           TTT     1     0     0     0     1     2     1     1     1     0     5     7     1     1     1  ...     0     0     0     2     0     0     1     0     0     0     0     0     0     0     0     0     0
92           T>G           TTA     0     0     0     0     0     0     0     0     1     0     2     2     0     0     0  ...     1     0     0     0     0     0     1     0     0     0     0     0     1     0     0     0     0
93           T>G           TTC     0     0     0     0     0     0     0     0     0     1     1     1     0     0     0  ...     0     1     0     0     0     0     0     0     0     0     1     0     0     0     0     1     0
94           T>G           TTG     0     0     0     0     0     0     2     1     3     0     1     0     0     0     0  ...     0     0     0     2     0     0     0     0     0     0     0     1     0     0     0     0     0
95           T>G           TTT     0     0     1     0     0     1     0     0     0     0     5     1     0     3     0  ...     0     0     0     1     0     0     0     0     0     1     2     0     1     2     1     1     0

sig_db
   Type SubType          SBS1          SBS2      SBS3      SBS5      SBS6          SBS8      SBS9         SBS13        SBS17a    SBS17b     SBS18         SBS20     SBS26     SBS30     SBS37     SBS40     SBS41
0   C>A     ACA  8.760230e-04  5.790000e-07  0.020920  0.012052  0.000425  4.431064e-02  0.000561  1.816879e-03  2.072799e-03  0.000608  0.051688  6.242480e-04  0.000877  0.001811  0.003963  0.028323  0.002120
1   C>A     ACC  2.220120e-03  1.455050e-04  0.016343  0.009337  0.000516  4.729956e-02  0.004047  7.088420e-04  9.052930e-04  0.000127  0.015617  1.380514e-03  0.000522  0.000501  0.001433  0.013254  0.001207
2   C>A     ACG  1.797270e-04  5.360000e-05  0.001808  0.001908  0.000053  4.767276e-03  0.000440  2.706560e-04  4.890000e-05  0.000060  0.002505  2.260000e-05  0.000118  0.000094  0.001092  0.003012  0.000063
3   C>A     ACT  1.265053e-03  9.760000e-05  0.012265  0.006636  0.000180  4.720459e-02  0.003063  3.472570e-04  6.190000e-05  0.000456  0.021469  1.249985e-03  0.000621  0.000559  0.001855  0.014858  0.001336
4   C>G     ACA  1.839055e-03  2.230000e-16  0.019813  0.010144  0.000471  4.350682e-03  0.004863  3.863364e-03  1.011366e-03  0.000146  0.001736  8.844347e-03  0.000429  0.001076  0.034416  0.012253  0.005355
..  ...     ...           ...           ...       ...       ...       ...           ...       ...           ...           ...       ...       ...           ...       ...       ...       ...       ...       ...
91  T>C     TTT  4.274201e-03  3.570000e-05  0.013957  0.018550  0.001738  4.584279e-03  0.038518  5.292180e-04  2.099382e-02  0.000998  0.003377  1.943161e-02  0.057560  0.000447  0.061507  0.010228  0.046241
92  T>G     TTA  2.170000e-16  1.640000e-05  0.007161  0.005149  0.000103  2.190000e-16  0.064829  1.803960e-04  2.180000e-16  0.000012  0.000686  2.200000e-16  0.001411  0.000117  0.018033  0.008345  0.041343
93  T>G     TTC  5.520000e-05  7.120000e-05  0.006401  0.006677  0.000291  1.160874e-03  0.008777  2.250000e-16  1.177210e-04  0.008864  0.002136  2.270000e-16  0.001751  0.000098  0.019830  0.011604  0.015783
94  T>G     TTG  5.776140e-04  9.540000e-05  0.008113  0.006984  0.000325  3.111109e-03  0.010974  3.670000e-05  9.231280e-04  0.004788  0.001458  2.819407e-03  0.002858  0.000819  0.030364  0.008716  0.019531
95  T>G     TTT  2.200000e-16  2.220000e-16  0.010543  0.013536  0.001009  9.991120e-04  0.064097  1.880000e-05  4.578653e-03  0.121753  0.005170  1.520297e-03  0.009476  0.008927  0.029151  0.025068  0.088168

I run spss based on these two dfs and it didn't run.

spss.single_sample(data, "spss_output", ref="GRCh38", exome=True, sig_database=sig_db)
##########################################################
Exacting Profile for Sample 1
>>>

I'm wondering if the input file was not formated correctly but I prepared them according to the example files, including the colnames.

lincj1994 commented 2 years ago

Even when I run the example data, nothing was generated in the spss_output folder. spss.single_sample(data, "spss_output", ref="GRCh38", exome=True, check_rules=False)

marcos-diazg commented 2 years ago

Hi, @lincj1994. The issue you have with the example data is due to the reference genome. GRCh37 should be used instead. In the case of your data, you need to provide only numbers in the pandas data frames for both the mutational matrix (data) and the signature database (sig_db).

I hope this helps, and thanks for your interest! Please reopen the issue in case you still have problems.

lincj1994 commented 2 years ago

Hi. Does it mean that I should match the mutation type and trinucleotide between data and sig_db and then remove the first two columns of both data frames?

lincj1994 commented 2 years ago

Hi. Does it mean that I should match the mutation type and trinucleotide between data and sig_db and then remove the first two columns of both data frames?

OK. It worked! Thanks.