AlexandrovLab / SigProfilerSingleSample

SigProfilerSingleSample allows attributing a known set of mutational signatures to an individual sample. The tool identifies the activity of each signature in the sample and assigns the probability for each signature to cause a specific mutation type in the sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
23 stars 2 forks source link

Input differs than output of SigProfilerExtractor #3

Closed anishaluthra closed 4 years ago

anishaluthra commented 4 years ago

Hi, I hope I'm not missing anything, but in the paper that came out earlier this year, it was mentioned that there were two steps to extracting the COSMIC signatures- SigProfilerExtractor and SigProfilerSIngleSample. None of the outputs of the SigProfilerExtractor seem to match the possible input formats of the SigProfilerSingleSample. I tried running SigProfilerSingleSample on the vcf files I have. The issues I'm having are:

  1. I am not getting any errors, but it seems like only a fraction of the samples are being extracted.
  2. I am not getting these files in my output folder, which I believe I should be getting: -exposure.txt -signature.txt -probabilities.txt -signature plot pdf -dendrogram plot -decomposition profile.csv I'm only getting decomposition_profile.csv and files that match the output of SigProfilerMatrixGeneratorFunc.

Thank you in advance!

marcos-diazg commented 4 years ago

Hi Anisha,

Thank you for your question. Your understanding is correct in regards to the two steps needed for the complete signature extraction process, both the de novo extraction of mutational signatures and the attribution of the somatic mutations to each previously extracted signature. However both of them are implemented in SigProfilerExtractor.

However, SigProfilerSingleSample is a stand-alone tool designed to be used when a limited number of samples are available and where the input file type is the same as SigProfilerExtractor, namely vcf files or a python dataframe. The difference is that SigProfilerSingleSample does not perform any extraction of sinatures, since it uses directly the reference COSMIC set of mutational signatures (or other set of provided signatures) to attribute the mutations of the samples.

Regarding your issues, can you please share the command you used to run SigProfilerSingleSample?

anishaluthra commented 4 years ago

Hi Marcos,

Okay, that makes a lot of sense - thank you for clearing that up. I'm interested in COMSIC signatures so I really don't need to be using SigProfilerExtractor then - is that correct?

Yes, I tried using both of these commands: spss.single_sample(data, output, ref="GRCh37", sig_database = "default", check_rules = True, exome=False) and spss.single_sample(data, output, ref="GRCh37", sig_database = sig_database, check_rules = True, exome=False), where data was the folder name with my vcfs, output was the folder name where I wanted the output, and sig_database = 'sigProfiler_SBS_signatures.csv'. Out of my 1677 samples, it only extracted the profile for 577 samples.

anishaluthra commented 4 years ago

I just tried changing data to a folder where I had the python dataframe (that I was able to use for SigProfilerExtractor) . When changing data to the file itself, I got this error NotADirectoryError: [Errno 20] Not a directory: 'attribution_table/SBS96_table.all/input/'

When I added the dataframe to an input folder within my project folder, I got this error: Traceback (most recent call last): File "sig_profiler_attribution_table.py", line 9, in <module> spss.single_sample(data, output, ref="GRCh37", sig_database = sig_database, check_rules = True, exome=False) File "/home/luthraa/.local/lib/python3.7/site-packages/sigproSS/spss.py", line 557, in single_sample data = matGen.SigProfilerMatrixGeneratorFunc(vcf_name, ref, vcf, exome=exome, tsb_stat= True) File "/home/luthraa/.local/lib/python3.7/site-packages/SigProfilerMatrixGenerator/scripts/SigProfilerMatrixGeneratorFunc.py", line 366, in SigProfilerMatrixGeneratorFunc samples = sorted(samples) UnboundLocalError: local variable 'samples' referenced before assignment

It seems like it was expecting a folder with vcf files if I'm not mistaken.

anishaluthra commented 4 years ago

It might be helpful to look at the output. I realized there are some warnings in there.

The two main warnings are these: MT is not supported. You will need to download that chromosome and create the required files. Continuing with the matrix generation... and There appears to be a duplicate single base substitution. Skipping this mutation: 1 36552395 C A(of many with varying mutations).

I downloaded the GrCH37 genome though (not sure if that's why the first warning is occurring).

I also uploaded the output log. Thank you again for your help! SigProfilerMatrixGenerator_wes_single_sample_GRCh372020-07-15.txt

marcos-diazg commented 4 years ago

Hi Anisha,

Thanks again for your interest. Sorry but I think your understanding is not correct. When you have such a large number of samples, the first approach would be to extract the signatures present in your own data using SigProfilerExtractor and then compare them with the COSMIC mutational signatures. This latter process is also performed by SigProfilerExtractor in what is called decomposed solution.

On the other hand, SigProfilerSingleSample can also be used, but in principle is intended for those cases when the number of samples is so low that you will not have the statistical power needed to confidently extract signatures from your samples. In this regard, COSMIC signatures will be used as a external resource to avoid the extraction process and be able to perform the attribution part.

Regarding your issues, it seems that you also have a lot of errors like this in SigProfilerMatrixGenerator ouput:

The reference base does not match the reference chromosome position. Skipping this mutation:

that are indicative that the reference genome is not correct. Could it be possible that you have a mix of samples processed with different reference genomes in your sample pool?

I would suggest to try running SigProfilerSingleSample in subsets of samples, since the output of this tool will be exactly the same that if you perform the analysis on the whole bunch of samples. That can help you to figure out which are the problematic samples and take a closer look into them.

Hope this helps!

anishaluthra commented 4 years ago

Hi Marcos,

Thank you for the explanation! I understand that SigProfilerSingleSample is intended for when we have fewer samples, but I'm wondering how the output would be different than the decomposed solution, since in the end, they both would have the same mutations attributed to the COSMIC signatures. Am I missing something? I think the reason I wanted to use SigProfilerSingleSample is because by running SigProfilerExtractor, it seemed like the extraction process took quite some time, so I thought I would be able to save time by running SigProfilerSingleSample.

Regarding the issues, thank you for your suggestion - I will try running SigProfilerSingleSample on subsets of samples.

lalexandrov1018 commented 4 years ago

Hi Anisha,

Just to add to Marcos's comments. SigProfilerSingleSample examines each sample individually and it should be used when you want to assign activities (i.e., numbers of mutations) of known COSMIC signatures. In contrast, SigProfilerExtractor examines a set of samples and discovers the signatures operative in these samples. As such, SigProfilerExtractor can identify novel signatures in your dataset that are not found in COSMIC. Note that SigProfilerExtractor will match de novo extracted signatures to COSMIC signatures in order to identify any novel signatures.

Currently, you will get different outputs from SigProfilerExtractor and SigProfilerSingleSample. We are in the process of synchronizing the implementations of these two tools and they should start yielding very similar results. In regard to recommendations, it is best to run SigProfilerExtractor for large number of samples. Indeed, the code is slower but it will find any potentially novel signatures as well as discover the baseline dataset of COSMIC signatures found in your data. You can also run SigProfilerSingleSample on the data, however, by default the tool will examine each sample separately and consider all existing COSMIC signatures.

Hope this helps!

anishaluthra commented 4 years ago

Understood - thank you for the very clear explanation!

I do have one more question - I am running SigProfilerSingleSample on a smaller subset of samples. I am getting this warning message in my output log (which I am also attaching): MT is not supported. You will need to download that chromosome and create the required files. Continuing with the matrix generation...

I've confirmed that these samples are processed with reference genome GRCh37. Should chromosome MT be part of the GRCh37 genome? I'm wondering why it would not be.

SigProfilerMatrixGenerator_210_GRCh372020-07-20.txt

lalexandrov1018 commented 4 years ago

We currently do not support mutations on the mitochondrial genome. This will be fixed in future versions.

anishaluthra commented 4 years ago

Got it - thank you so much for your clear responses!

LawrenceH622 commented 4 months ago

Hi there, I met the same issue as Anishaluthra when I was running the SigProfilerAssignment. I put a set of samples path to run the command of SigProfilerAssignment but it didn't read through all the samples in the folder just read a few of them. but I haven't seen any error. May I have the solution to this? Thanks my running script is shown below.

Analyze.cosmic_fit(samples="/staging/biology/U123/convertion_testing/result/O7_hg38/", output="/staging/biology/u123/convertion_testing/result/O7_hg38/Signature", input_type="vcf", context_type="96", genome_build="GRCh38", cosmic_version=3.4, sample_reconstruction_plots="both" )