How to handle 10x multiplexed scRNA-seq runs

Thapeachydude commented 1 year ago

Hello,

I've read the preprint with great interest and wanted to give the tool a try. We have 10x 5' scRNA-seq data where multiple individuals were pooled together and an individual never occured twice in a pool. I was wondering how to best run the tool?

Can I directly supply the bam file for the full multiplexed pool or would I have to split off the barcodes for each individual?
If multiple samples were analysed (in different sample pools) I guess the tool would have to be run separately for every pool?
How detailed does the cell type annotation have to be? Is it sufficient to stratisfy "tumor cells" and "non tumor cells" or the more finegrained the annotation the better?

Many thanks!

Francesc-Muyas commented 1 year ago

Dear user, First of all, thanks for using SComatic. Please find below the answer to all your questions:

Can I directly supply the bam file for the full multiplexed pool or would I have to split off the barcodes for each individual?

Nice question. So far we do not work with the full multiplexed pool bams. Ideally, you should split the original bam file for each pool (or individual) to avoid problems with cell barcodes and take maximum advantage of SComatic.

If multiple samples were analysed (in different sample pools) I guess the tool would have to be run separately for every pool?

Ideally yes, especially to avoid problems with different cells (and different cell types) having the same cell barcode. However, if you have different samples from the same individual (same biopsy of interest) and their corresponding cell barcode annotations, you could (1) merge these bam files into a single one, (2) remove the duplicated cell barcodes from the metadata file provided in the parameter --meta from the python script scripts/SplitBam/SplitBamCellTypes.py , and (3) run SComatic as a single sample.

How detailed does the cell type annotation have to be? Is it sufficient to stratisfy "tumor cells" and "non tumor cells" or the more finegrained the annotation the better?

Although the tool can be run with only two cell types (p.e "tumor cells" and "non tumor cells" as you mentioned), I would recommend running it with more detailed cell type annotations. Comparing more cell types would help to recognise better systematic errors (mapping and/or sequencing errors).

Cheers, Fran

Thapeachydude commented 1 year ago

Hi Fran,

many thanks for the quick reply. I now ran the tool for multiple sample pools. Each pool was run independently, so barcode collisions were impossible. While some of the reported mutations make sense, I'm struggling with a (likely) high false-positive rate. E.g. a gene is reported to be mutated in almost 100% of tumor samples, but reports from literature suggest a ≈ 10% mutation frequency is more adequate. Likewise, we observe mutations not called in WES data from the same samples.

I'll try using a more detailed cell type annotation next, e.g. non tumor type A, non tumor type B, tumor cell type... Are there additional parameters you would recommend to tweak?

I ran the tool using the genome.fa file from Cellranger as reference. I included the PoN and RNA-editing files, included in this tool.

Cheers, M.

Francesc-Muyas commented 1 year ago

Dear user, Please find below the answer for each one of your points:

While some of the reported mutations make sense, I'm struggling with a (likely) high false-positive rate. E.g. a gene is reported to be mutated in almost 100% of tumor samples, but reports from literature suggest a ≈ 10% mutation frequency is more adequate.

When you speak about ~100% of samples with a gene mutated, do you speak about all samples carrying the same mutation in the same gene or different mutations in the same gene across samples? do you consider all mutations or only high-impact ones (usually, cancer analysis provide frequencies of high-impact mutations)? To further understand this question, could you also show me a few examples of these recurrent "false" variants across samples (header + lines from the out file obtained by SComatic/scripts/BaseCellCalling/BaseCellCalling.step2.py )?

Likewise, we observe mutations not called in WES data from the same samples.

These types of variants (scRNA-seq-specific variants) do not need to be false positive calls, as many other variables such as cancer heterogeneity, purity, and coverage in WES play an important role (check Figures 2 and 3 from the bioRxiv manuscript). Nevertheless, we can investigate what is going on in this case. Could you tell me the mean coverage achieved in the WES data? One important thing to consider here is that scRNA-seq (like 10x) usually expands regions not covered by WES approaches, so could you check (for instance opening the WES bam in IGV) if you can see reads supporting the alternative alleles for a subset of variants detected by scRNA-seq and not by your WES pipeline?

I'll try using a more detailed cell type annotation next, e.g. non tumor type A, non tumor type B, tumor cell type... Are there additional parameters you would recommend to tweak?

Increasing the number of cell types helps to detect better systematic errors and germline variants. If you can increase this number would be awesome. Regarding parameters, we generated all our results with the parameters described in the Example of how to run SComatic section. Please, use these parameters whenever possible.

I ran the tool using the genome.fa file from Cellranger as reference. I included the PoN and RNA-editing files, included in this tool.

Could you confirm that this reference genome is the Hg38 version of the human genome with chromosome prefixes (chr1, chr2...)? This is crucial to be able to use both the PoN and RNA-editing files that we provided.

Additional filters that might be helpful:

Create your own Panel of Normals. This will help you to remove recurrent artefacts across samples or biases specific to your dataset. Once you run SComatic/scripts/BaseCellCalling/BaseCellCalling.step2.py with our PoN, you can re-run again this script but with your own PoN.
Estimate new parameters for the Beta-Binomial distribution. Although our Beta-Binomial parameters were estimated with many samples, other datasets might have different background noises, so you could try to estimate new alpha and beta parameters with your data. However, this computation requires a high number of tumour-free samples.
Ignore variants with > 1 % population frequency in the gnomAD database to remove undetected germline variants.

Of course, feel free to contact me if you want to discuss your project in a more private chat (fmuyas@ebi.ac.uk).

Cheers, Fran

cortes-ciriano-lab / SComatic

How to handle 10x multiplexed scRNA-seq runs #1