Closed burcakotlu closed 3 weeks ago
Dear @burcakotlu,
In the output directory when chrom_based is True
there will be the file for the Y chromosome named similarly to _exampleproject.SBS6.all.chrY. Each column represents a sample, so by checking whether that column is non-zero, you can determine whether there are mutations on chrY or not.
Dear @mdbarnesUCSD,
Thanks for the explanation. I was checking those files. I have the latest versions of the tools: SigProfilerMatrixGenerator 1.2.30 and SigProfilerSimulator 1.1.6
Chrom-based files were created without "chrom_based=True" with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True)
For testing purposes, I used the following 2 vcf files:
PD39500a.caveman_strelka2_filtered.consensus_snv.vcf
PD39500a.pindel_strelka2_filtered.consensus_indel.vcf
which can be found under
/tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples
I called SPMG for these 2 vcf files. Returned matrices didn't have any key for indels. Keys are as follows: 6144, 384, 1536, 96, 6, 24, 4608, 288. But one of the vcf files contains indels. and there are indels under /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/ID as a result of SPMG call.
Minor: Why are chrom_based files generated even if chrom_based isn't set to True, is it due to seqInfo=True? Major: Why can't I get any key for indels in the returned matrices after the SPMG call?
Thanks, Burcak
Thanks for sharing the command and files for reproducing the issue. I tested with chrom_based=False
and a standard matrix was returned and no chrom_based files were generated. I suspect the chrom_based files exist in your environment because they were generated during a previous run where chrom_based=True
.
There is some inconsistent behavior with how the indel matrices are returned, though the matrices are written to file. I observed that when chrom_based=False
the ID matrix is returned, but when chrom_based=True
there is no ID matrix returned.
The issue originates from this line for indels (and also doublet base substitutions): SigProfilerMatrixGeneratorFunc.py - line 2838
SigProfilerMatrixGeneratorFunc.py - line 2078
We will include a patch for these in the next release.
Dear all,
I deleted all the directories and made a clean run with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True, chrom_based=False)
DEBUG mutation_types: ['SBS', 'DBS', 'ID'] DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/DBS/test_samples.DBS78.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/ID/test_samples.ID83.all.chrY DEBUG chrY_num_of_mutations: 0
Matrices have keys for SBS, DBS and ID mutation types. Chrom-based files are written at some point, but later so when I check for the number of mutations on chrY, it can not reach to chrY files. Chrom_based files are written maybe due to seqInfo=True.
I deleted all the directories and made a clean run with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True, chrom_based=True)
DEBUG mutation_types: ['SBS'] DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY exists DEBUG chrY_num_of_mutations: 110
Matrices have keys only for SBS mutation types. No key for DBS or ID mutation types. Chrom-based files are written so when I check for the number of mutations on chrY, it can reach to chrY files for SBS only.
I need mutation types for existing mutations. I need to reach to chr-based files for all mutation types so that I can understand whether there are mutations on the chr Y.
Thanks, Burcak
It seems that there are two things that may be happening.
If you want a matrix file for the mutations on chromosome Y, then you will need to run chrom_based=True
. In this case, the chromosome based matrix will not be returned in memory so you will need to navigate to the output/SBS/ project_name.SBS96.all.chrY
to read the file in.
If you want information on the mutation context for each mutation on chromosome Y, you will need to run seqInfo=True
. This will produce output/vcf_files/SNV/Y_seqinfo.txt
. This file will be generated regardless of whether there are mutations on chromosome Y or not.
The parameters chrom_based
and seqInfo
are independent from each other.
Please re-open if you are still encountering issues.
If SPMG is run with seqInfo=True and chrom_based=True, are the matrices returned by SPMG have keys for all given mutation types such as SBS, DBS, and ID mutation types in this specific case?
Also, can we get the number of mutations on ChrY by reading the corresponding files immediately after the SPMG call?
Thanks
The v1.2.31 release resolves the issue of the matrices not being returned. The number of mutation on ChrY can be determine by reading the corresponding files immediately after the SPMG call. Thanks!
Dear all,
Normally, SigProfilerSimulator simulates as gender='female' by default. But I want to check whether there are mutations on chrY and if yes, I want to run SigProfilerSimulator simulates as gender='male'
For this aim, I called SPMG with chrom_based=True However, I couldn't see any chr-based key even I called SPMG with chrom_based=True
Please have a look below: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=seqInfo)
matrices.keys(): dict_keys(['6144', '384', '1536', '96', '6', '24', '4608', '288', '18', 'DINUC', 'ID'])
chrom_based_matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, chrom_based=True, plot=False, seqInfo=seqInfo)
chrom_based_matrices.keys(): dict_keys(['6144', '384', '1536', '96', '6', '24', '4608', '288', '18'])
My question is whether there is a way to learn the number of mutations on chrY by calling SPMG.
Thanks, Burcak