Closed Thapeachydude closed 1 year ago
Dear user, First of all, thanks for using SComatic. Please find below the answer to all your questions:
Nice question. So far we do not work with the full multiplexed pool bams. Ideally, you should split the original bam file for each pool (or individual) to avoid problems with cell barcodes and take maximum advantage of SComatic.
Ideally yes, especially to avoid problems with different cells (and different cell types) having the same cell barcode. However, if you have different samples from the same individual (same biopsy of interest) and their corresponding cell barcode annotations, you could (1) merge these bam files into a single one, (2) remove the duplicated cell barcodes from the metadata file provided in the parameter --meta from the python script scripts/SplitBam/SplitBamCellTypes.py , and (3) run SComatic as a single sample.
Although the tool can be run with only two cell types (p.e "tumor cells" and "non tumor cells" as you mentioned), I would recommend running it with more detailed cell type annotations. Comparing more cell types would help to recognise better systematic errors (mapping and/or sequencing errors).
Cheers, Fran
Hi Fran,
many thanks for the quick reply. I now ran the tool for multiple sample pools. Each pool was run independently, so barcode collisions were impossible. While some of the reported mutations make sense, I'm struggling with a (likely) high false-positive rate. E.g. a gene is reported to be mutated in almost 100% of tumor samples, but reports from literature suggest a ≈ 10% mutation frequency is more adequate. Likewise, we observe mutations not called in WES data from the same samples.
I'll try using a more detailed cell type annotation next, e.g. non tumor type A, non tumor type B, tumor cell type... Are there additional parameters you would recommend to tweak?
I ran the tool using the genome.fa
file from Cellranger as reference. I included the PoN and RNA-editing files, included in this tool.
Cheers, M.
Dear user, Please find below the answer for each one of your points:
When you speak about ~100% of samples with a gene mutated, do you speak about all samples carrying the same mutation in the same gene or different mutations in the same gene across samples? do you consider all mutations or only high-impact ones (usually, cancer analysis provide frequencies of high-impact mutations)? To further understand this question, could you also show me a few examples of these recurrent "false" variants across samples (header + lines from the out file obtained by SComatic/scripts/BaseCellCalling/BaseCellCalling.step2.py )?
These types of variants (scRNA-seq-specific variants) do not need to be false positive calls, as many other variables such as cancer heterogeneity, purity, and coverage in WES play an important role (check Figures 2 and 3 from the bioRxiv manuscript). Nevertheless, we can investigate what is going on in this case. Could you tell me the mean coverage achieved in the WES data? One important thing to consider here is that scRNA-seq (like 10x) usually expands regions not covered by WES approaches, so could you check (for instance opening the WES bam in IGV) if you can see reads supporting the alternative alleles for a subset of variants detected by scRNA-seq and not by your WES pipeline?
Increasing the number of cell types helps to detect better systematic errors and germline variants. If you can increase this number would be awesome. Regarding parameters, we generated all our results with the parameters described in the Example of how to run SComatic section. Please, use these parameters whenever possible.
Could you confirm that this reference genome is the Hg38 version of the human genome with chromosome prefixes (chr1, chr2...)? This is crucial to be able to use both the PoN and RNA-editing files that we provided.
Of course, feel free to contact me if you want to discuss your project in a more private chat (fmuyas@ebi.ac.uk).
Cheers, Fran
Hello,
I've read the preprint with great interest and wanted to give the tool a try. We have 10x 5' scRNA-seq data where multiple individuals were pooled together and an individual never occured twice in a pool. I was wondering how to best run the tool?
Many thanks!