AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
149 stars 50 forks source link

Running tool with custom CNV calls #252

Open kbrar4013 opened 1 month ago

kbrar4013 commented 1 month ago

Hi,

I'm having issues trying to run SigProfiler for CNV data using custom CNV calls. We have an in-house pipeline for CNV calls that is similar to the ASCAT approach, tailored to our WGS data. I tried to match the format of my calls to what ASCAT_NGS includes in their documentation ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097604/]).

The columns I provided are (in this order): 'sample', 'segment_number', 'chromosome', 'start_position', 'end_position', 'major_normal', 'minor_normal', 'major_tumor', 'minor_tumor'. This would correspond to what ASCAT_NGS says is the output format for the 'copynumber.caveman.csv' file.

I'm running Sigprofiler as follows:

from SigProfilerExtractor import sigpro as sig

def main_function():
    segment_file = "/home/kbrar/MOCHA_Jun_2024/somatic/cnv/adjcopies_segments/ascat_format_output.csv"

    sig.sigProfilerExtractor("seg:ASCAT_NGS", "/home/kbrar/MOCHA_Jun_2024/somatic/cnv/sigprofiler_CNV_Jul2024", segment_file, reference_genome="GRCh38", opportunity_genome="GRCh38")

if __name__=="__main__":
   main_function()

However, when I run sigprofiler as above, I get a KeyError. If I try to add the "Tumour TCN", "Normal BCN", and "Tumour BCN" columns as the errors indicate the tool is asking for, I get a KeyError with 'sample', which is definitely a column in the input CSV file. I've pasted that below.

Would you be able to clarify the exact input format needed for "ASCAT_NGS" type of input, or the specific columns required for any of the input types? I am able to adjust our input to whichever columns SigProfiler expects, but it is just unclear what exact columns should be provided for each respective input. If this could be clarified, I would be able to adjust my input column names to what is expected. Thanks!!

** Reported Current Memory Use: 0.5 GB *****

Traceback (most recent call last): File "/home/kbrar/miniforge3/envs/sigprofiler/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'sample'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/kbrar/python_shell_scripts/sigprofiler_cnv_run.py", line 10, in main_function() File "/home/kbrar/python_shell_scripts/sigprofiler_cnv_run.py", line 7, in main_function sig.sigProfilerExtractor(input_type="seg:ASCAT_NGS", output="/home/kbrar/MOCHA_Jun_2024/somatic/cnv/sigprofiler_CNV_Jul2024", input_data=segment_file, reference_genome="GRCh38", opportunity_genome="GRCh38") File "/home/kbrar/miniforge3/envs/sigprofiler/lib/python3.11/site-packages/SigProfilerExtractor/sigpro.py", line 680, in sigProfilerExtractor genomes = scna.generateCNVMatrix( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kbrar/miniforge3/envs/sigprofiler/lib/python3.11/site-packages/SigProfilerMatrixGenerator/scripts/CNVMatrixGenerator.py", line 466, in generateCNVMatrix nmf_matrix, annotated_df = annotateSegFile(df, file_type, project, output_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kbrar/miniforge3/envs/sigprofiler/lib/python3.11/site-packages/SigProfilerMatrixGenerator/scripts/CNVMatrixGenerator.py", line 72, in annotateSegFile columns = list(df["sample"].unique()) ~~^^^^^^^^^^ File "/home/kbrar/miniforge3/envs/sigprofiler/lib/python3.11/site-packages/pandas/core/frame.py", line 3807, in getitem indexer = self.columns.get_loc(key) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kbrar/miniforge3/envs/sigprofiler/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc raise KeyError(key) from err KeyError: 'sample'

kbrar4013 commented 1 month ago

I also will mention that in some instances I get the key error "KeyError: '1:het:100kb-1Mb'" instead.

kbrar4013 commented 1 month ago

I was able to get it to work if I round all the copy numbers to whole numbers - is this required for SigProfiler?