AlexandrovLab / SigProfilerMatrixGenerator

SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.
BSD 2-Clause "Simplified" License
101 stars 37 forks source link

For simulating mutations in chrY, I need whether number of mutations in chrY is greater than zero using SPMG. #201

Closed burcakotlu closed 3 weeks ago

burcakotlu commented 1 month ago

Dear all,

Normally, SigProfilerSimulator simulates as gender='female' by default. But I want to check whether there are mutations on chrY and if yes, I want to run SigProfilerSimulator simulates as gender='male'

For this aim, I called SPMG with chrom_based=True However, I couldn't see any chr-based key even I called SPMG with chrom_based=True

Please have a look below: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=seqInfo)

matrices.keys(): dict_keys(['6144', '384', '1536', '96', '6', '24', '4608', '288', '18', 'DINUC', 'ID'])

chrom_based_matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, chrom_based=True, plot=False, seqInfo=seqInfo)

chrom_based_matrices.keys(): dict_keys(['6144', '384', '1536', '96', '6', '24', '4608', '288', '18'])

My question is whether there is a way to learn the number of mutations on chrY by calling SPMG.

Thanks, Burcak

mdbarnesUCSD commented 1 month ago

Dear @burcakotlu,

In the output directory when chrom_based is True there will be the file for the Y chromosome named similarly to _exampleproject.SBS6.all.chrY. Each column represents a sample, so by checking whether that column is non-zero, you can determine whether there are mutations on chrY or not.

burcakotlu commented 1 month ago

Dear @mdbarnesUCSD,

Thanks for the explanation. I was checking those files. I have the latest versions of the tools: SigProfilerMatrixGenerator 1.2.30 and SigProfilerSimulator 1.1.6

Chrom-based files were created without "chrom_based=True" with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True)

For testing purposes, I used the following 2 vcf files: PD39500a.caveman_strelka2_filtered.consensus_snv.vcf
PD39500a.pindel_strelka2_filtered.consensus_indel.vcf which can be found under /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples

I called SPMG for these 2 vcf files. Returned matrices didn't have any key for indels. Keys are as follows: 6144, 384, 1536, 96, 6, 24, 4608, 288. But one of the vcf files contains indels. and there are indels under /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/ID as a result of SPMG call.

Minor: Why are chrom_based files generated even if chrom_based isn't set to True, is it due to seqInfo=True? Major: Why can't I get any key for indels in the returned matrices after the SPMG call?

Thanks, Burcak

mdbarnesUCSD commented 1 month ago

Thanks for sharing the command and files for reproducing the issue. I tested with chrom_based=False and a standard matrix was returned and no chrom_based files were generated. I suspect the chrom_based files exist in your environment because they were generated during a previous run where chrom_based=True.

There is some inconsistent behavior with how the indel matrices are returned, though the matrices are written to file. I observed that when chrom_based=False the ID matrix is returned, but when chrom_based=True there is no ID matrix returned.

The issue originates from this line for indels (and also doublet base substitutions): SigProfilerMatrixGeneratorFunc.py - line 2838

SigProfilerMatrixGeneratorFunc.py - line 2078

We will include a patch for these in the next release.

burcakotlu commented 1 month ago

Dear all,

I deleted all the directories and made a clean run with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True, chrom_based=False)

DEBUG mutation_types: ['SBS', 'DBS', 'ID'] DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/DBS/test_samples.DBS78.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/ID/test_samples.ID83.all.chrY DEBUG chrY_num_of_mutations: 0

Matrices have keys for SBS, DBS and ID mutation types. Chrom-based files are written at some point, but later so when I check for the number of mutations on chrY, it can not reach to chrY files. Chrom_based files are written maybe due to seqInfo=True.

I deleted all the directories and made a clean run with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True, chrom_based=True)

DEBUG mutation_types: ['SBS'] DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY exists DEBUG chrY_num_of_mutations: 110

Matrices have keys only for SBS mutation types. No key for DBS or ID mutation types. Chrom-based files are written so when I check for the number of mutations on chrY, it can reach to chrY files for SBS only.

I need mutation types for existing mutations. I need to reach to chr-based files for all mutation types so that I can understand whether there are mutations on the chr Y.

Thanks, Burcak

mdbarnesUCSD commented 1 month ago

It seems that there are two things that may be happening.

If you want a matrix file for the mutations on chromosome Y, then you will need to run chrom_based=True. In this case, the chromosome based matrix will not be returned in memory so you will need to navigate to the output/SBS/ project_name.SBS96.all.chrY to read the file in.

If you want information on the mutation context for each mutation on chromosome Y, you will need to run seqInfo=True. This will produce output/vcf_files/SNV/Y_seqinfo.txt. This file will be generated regardless of whether there are mutations on chromosome Y or not.

The parameters chrom_based and seqInfo are independent from each other.

mdbarnesUCSD commented 3 weeks ago

Please re-open if you are still encountering issues.

burcakotlu commented 3 weeks ago

If SPMG is run with seqInfo=True and chrom_based=True, are the matrices returned by SPMG have keys for all given mutation types such as SBS, DBS, and ID mutation types in this specific case?

Also, can we get the number of mutations on ChrY by reading the corresponding files immediately after the SPMG call?

Thanks

mdbarnesUCSD commented 3 weeks ago

The v1.2.31 release resolves the issue of the matrices not being returned. The number of mutation on ChrY can be determine by reading the corresponding files immediately after the SPMG call. Thanks!