cortes-ciriano-lab / SComatic

A tool for detecting somatic variants in single cell data
Other
173 stars 28 forks source link

How to estimate single-cell mutational burdens #65

Open xyzheng123 opened 2 months ago

xyzheng123 commented 2 months ago

Hi,

Thank you for developing this tool. I would like to use it to estimate mutational burdens at both the cell-type and single-cell levels. While there is some discussion regarding mutational burden at the cell-type level, I am wondering if you could explain how single-cell mutational burdens can be estimated using SComatic.

In the original article, you mentioned: "To estimate single-cell mutational burdens, we divided the number of mutations detected in each unique cell by the number of sites with a sequencing depth of at least one read, within the set of callable sites across all cells of the same type."

I'm not entirely sure which outputs I should use or what steps are required to transform SComatic's outputs in order to obtain "the number of mutations detected in each unique cell" and "the number of sites with a sequencing depth of at least one read, within the set of callable sites across all cells of the same type."

Thanks, Xiang

ArthurDondi commented 2 months ago

Hi,

You should have a read at https://github.com/cortes-ciriano-lab/SComatic/blob/main/docs/OtherFunctionalities.md

There, you're interested in SingleCellGenotype.py for the mutational burden and SitesPerCell.py for the callable sites.

You can find how to run it here: https://github.com/cortes-ciriano-lab/SComatic/blob/main/docs/SComaticExample.md

For SingleCellGenotype.py, you should first filter and keep only the PASS mutations in the FILTER column of your BaseCellCalling.step2.tsv file.

Let me know if it worked for you! Arthur

xyzheng123 commented 1 month ago

Hi @ArthurDondi,

Sorry for the late response, and thank you so much for your help. I was able to follow your steps and estimate the mutational burden at the single-cell level for the dataset I’m working with. I assumed that each unique cell barcode corresponds to a cell, and each row in SingleCellGenotype.py that passed the filter represents a mutation. After filtering, I calculated the occurrence of mutations for each barcoded cell. When I ranked them from the highest to the lowest occurrence, I found that the highest number of mutations was only 6 (out of 445,222 callable sites). I’m not sure if these numbers seem too low—do you have any insights on the typical magnitude of mutations (& callable sites) per cell?

Best, Xiang

ArthurDondi commented 1 month ago

You mean that the highest number of mutations present in a cell was 6? From how many PASS mutations total in BaseCellCalling.step2.tsv ? You can see that by running awk -F'\t' '{if ($6 == "PASS") {print $0}}' output.step4.2.tsv > only_PASS_output.step4.2.tsv in the command line (and then count lines with wc -l only_PASS_output.step4.2.tsv)

In a recent analysis I had up to 85 mutations in a cell with 265011 reads, from a total of 341 PASS mutations, but the median was around 40 mutations per cell.

If I'm correct, your 445,222 callable sites are for the cell type, not individual cells, and for this cell type I had 99,771,226 callable sites, which is way more than you, so your 6 mutations in a single cell does not seem too bad. It depends how many PASS mutations you have to start with.