AlexandrovLab / SigProfilerAssignment

Assignment of known mutational signatures to individual samples and individual somatic mutations
BSD 2-Clause "Simplified" License
47 stars 10 forks source link

Extract mutations assigned to each SBS signature #148

Closed itigupta2429 closed 3 weeks ago

itigupta2429 commented 1 month ago

Hi Team, I had a query. I would like to extract which mutations are assigned to which SBS signature for each sample. Is there a way I can do that? I am using sigProfiler Assignment. Thanks in advance

mdbarnesUCSD commented 1 month ago

Hi @itigupta2429,

If you are looking to assign COSMIC signatures to a set of samples then you will want to use cosmic_fit. Please reach out if you have any additional questions.

itigupta2429 commented 1 month ago

Hi @mdbarnesUCSD,

I am using the following command:

Analyze.cosmic_fit(
    samples="./test/BRCA_vcf", 
    output="test_vcf", 
    input_type="vcf", 
    genome_build="GRCh37", 
    context_type="96", 
    export_probabilities_per_mutation="TRUE", 
    export_probabilities="TRUE"
)

After running this, I expected to find a file that contains mutation-level assignments, indicating which SBS signature is linked to each mutation from the input VCF. However, after thoroughly reviewing all the output files, I can't locate any file that provides this information.

Could you please clarify where this file should be generated or if there’s an additional step required to obtain mutation-wise signature assignments?

Thanks for your help!

mdbarnesUCSD commented 1 month ago

Thanks for clarifying. It seems that you would like to export the probabilities per mutation file. I suspect the issue is that you are currently passing the string "TRUE". Please try again using the boolean True as shown below.

Analyze.cosmic_fit(
    samples="./test/BRCA_vcf", 
    output="test_vcf", 
    input_type="vcf", 
    genome_build="GRCh37", 
    context_type="96", 
    export_probabilities_per_mutation=True, 
    export_probabilities=True
)

Please let us know if this resolves the issue. Thanks!

itigupta2429 commented 1 month ago

Thanks for your quick response! With the earlier code as well I was getting the probabilities per mutation file (inside: BRCA_vcf/output/vcf_files/SNV) The folder contains chromosome wise files; and If I see one file it contains information like: PD4120a 10 71718 N:GA[T>A]CA 1 PD4120a 10 115370 N:AC[T>A]AC -1 PD4120a 10 117751 N:CT[C>A]AG -1 PD4120a 10 212461 U:TT[C>T]AG 1 PD4120a 10 247953 U:AC[C>G]TG 1 PD4120a 10 311033 N:AC[C>T]GG -1 PD4120a 10 369240 T:GG[C>G]TC 1 PD4120a 10 387315 T:CT[C>G]AG 1 PD4120a 10 442142 T:TT[C>G]AA 1 PD4120a 10 471214 T:AG[C>T]GG 1 PD4120a 10 484448 U:CT[C>G]AA -1 PD4120a 10 520650 T:CT[C>G]TT 1 PD4120a 10 646938 T:CT[C>T]AT 1 PD4120a 10 657996 U:CT[C>T]CC -1 I have a couple of questions regarding this:

  1. How would I know to which SBS does this mutation (chr10 71718) assigns to?
  2. What does +1 & -1 mean here?
mdbarnesUCSD commented 3 weeks ago

Hi @itigupta2429,

That is an output file from SigProfilerMatrixGenerator when generating the mutational count matrix and is not the probabilities file. The "+1" denotes the reference strand and the "-1" denotes the reverse complement of the reference strand.

The probabilities files that are generated from export_probabilities_per_mutation=True are generated per sample and are located in Assignment_Solution/Activities/Decomposed_Mutation_Probabilities/.

Please re-open this issue if you have any questions or are unable to produce the probabilities file.

Edit: Corrected path to the output files for when export_probabilities_per_mutation=True.

lntran26 commented 1 week ago

Hi @mdbarnesUCSD , I want to follow up on this issue since I'm trying to do the same thing. I was able to generate the Decomposed_Mutation_Probabilities.txt file and just have a clarifying question about the interpretation. For example, below is one variant's probabilities across all COSMIC SBS signatures. Does SigProfiler then simply assign this variant to the SBS with the highest probability, e.g. 0.5122639502012157 for SBS4 in this case, or are the probabilities treated in some more sophisticated way before one SBS is assigned to this variant?

0.0003005887660689532 0.0 0.0 0.5122639502012157 0.11459261568340642 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.026471443734749123 0.0 0.0 0.0 0.04683144618898701 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2995399554255728 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

If it's simply assigning variant to the SBS with the highest probability, are there any concerns about cases like below where two or more of the signatures have much more similar probabilities (0.3289042831025004 and0.3476818137313792)? 0.0008258735882063193 0.0 0.0 0.3289042831025004 0.3476818137313792 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00930894348972198 0.0 0.0 0.0 0.0524782082800048 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2608008778081873 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Thanks in advance!

mdbarnesUCSD commented 1 week ago

Hi @lntran26,

Apologies for any confusion, I previously provided the incorrect path to the results from when export_probabilities_per_mutation=True. The comment has been updated now with the correct path.

lntran26 commented 1 week ago

@mdbarnesUCSD Thanks for the updated path. I took at look at the output files there but still have the same questions as above. Could you confirm that SigProfiler algorithm simply calls the SBS with the highest probability or it's more complicated? I do notice that the results here tend to skew way more toward a dominant SBS compare to the previous path:

6.040989045139073e-05 0.0 0.0 0.7239713074945375 0.14000906911576955 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.013243712486085144 0.0 0.0 0.0 0.03891348241870776 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.08380201859444865 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

but there are still variants where it can be quite close, e.g. between two signatures:

0.0001331437736342248 0.0 0.0 0.015046617411096077 0.4462584121397326 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.04714382614493876 0.0 0.0 0.0 0.114888454105007 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3765295464255912 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

or even three signatures:

0.0004209922073199189 0.0 0.0 0.09195513113684133 0.2659303843753061 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.08492596007440775 0.0 0.0 0.0 0.23650530304941902 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.32026222915670594 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0