AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
153 stars 51 forks source link

Strategies to improve cosine similarity per sample #222

Closed sainadfensi closed 11 months ago

sainadfensi commented 11 months ago

Hi,

Thank you for the development and maintenance of this great tool! I've using the tool on a dataset for selected variants. Maybe because of selected variants, the cosine similarities for some patients are low. The lowest can be about 0.13, which is too low, right?

Can I have some advice to increase the cosine similarity? Thanks very much!

marcos-diazg commented 11 months ago

Hi @sainadfensi,

Thanks for reaching out! I'm not sure what you mean by selected variants, but those levels of cosine similarities are extremely low, potentially related to having low numbers of mutations in your input files.

As described previously, the average cosine similarity between two random nonnegative vectors is 0.75 (Bergstrom et al. 2020). So, if you are getting values below that, it is probably better to exclude those samples, combine them by groups, or avoid unnecessary filtering (for example, although quality control filtering of variant calling is strongly advised, it is not recommended to filter for pathogenic/driver mutations before performing extraction of mutational signatures).

I hope this helps, and please feel free to reach out by email (mdiazgay@ucsd.edu) if you have further questions. Thanks again for your interest!

Best wishes,

Marcos