AlexandrovLab / SigProfilerMatrixGenerator

SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.
BSD 2-Clause "Simplified" License
100 stars 37 forks source link

SigProfilerMatrixGenerator only generate SBS/DBS/ID matrices for 3400+ samples? #94

Closed VictorZheng1010 closed 2 years ago

VictorZheng1010 commented 2 years ago

Hello,

I've downloaded 8000+ TCGA samples to analyze the mutational signatures. I converted the maf files to ICGC simple somatic mutation format as one file:

Project Sample ID Genome mut_type chrom pos_start pos_end ref alt Type TCGA TCGA-G3-AAV3-01 . GRCh37 INS 10 32740800 32740801 - A SOMATIC TCGA TCGA-G3-AAV3-01 . GRCh37 SNP 10 43292088 43292088 G A SOMATIC TCGA TCGA-G3-AAV3-01 . GRCh37 SNP 10 48370493 48370493 G A SOMATIC TCGA TCGA-G3-AAV3-01 . GRCh37 SNP 10 6504265 6504265 C A SOMATIC

When I run SigProfilerMatrixGenerator, it told me that "The given input files do not appear to be in the correct simple text format. Skipping this file: ......". After running, SigProfilerMatrixGenerator only generated SBS/DBS/ID matrices for about 3400+ samples. Other 4000+ samples were omitted. Then I just used the rest 4000+ samples as input, it also generated the matrices for about 3400+ samples.

I don't know what's the cause of this problem. Hope you can look into this issue.

BR, WSZ

mdbarnesUCSD commented 2 years ago

Hi @VictorZheng1010,

Please check to see if any of the files have formatting issues that is causing them not to be processed (or the rest to be processed). For now, please generate matrices for subsets of samples and then combine the columns at the end to create an 8000+ column matrix.

Thanks!