AlexandrovLab / SigProfilerMatrixGenerator

SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.
BSD 2-Clause "Simplified" License
99 stars 37 forks source link

mysterious hyphens when processing INDELs from ICGC data #159

Open mattiyeh opened 11 months ago

mattiyeh commented 11 months ago

https://github.com/AlexandrovLab/SigProfilerMatrixGenerator/blob/f945199230a4fc0671d90a7873b079930a84d227/SigProfilerMatrixGenerator/scripts/convert_input_to_simple_files.py#L332C10-L332C10

Hello, why are the hyphens added to ref and mut when the other functions don't do similar actions? This breaks downstream because they are added again in MutationMatrixGenerator.py (lines 1176-1179) and then you can get a KeyError at line 1617 revcompl(type_sequence) because the '-' character is not in the revcompl map.

i fixed this by commenting out the lines in convert_input_to_sample_files, but can someone explain if this will have unintended consequences?

thanks, Marc

mdbarnesUCSD commented 10 months ago

Hi @mattiyeh,

Thanks for reaching out again about the issue you encountered with ICGC input files. It would be a great help if you could please provide an input file to reproduce the issue you identified. Thanks!

mattiyeh commented 10 months ago

Hi Mark,

Sure. here is a sample input file.

stomach_indel_mutations.txt

Thanks, Marc