High (False)-Discoveries

Dear user, Please find below the answer for each one of your points:

- While some of the reported mutations make sense, I'm struggling with a (likely) high false-positive rate. E.g. a gene is reported to be mutated in almost 100% of tumor samples, but reports from literature suggest a ≈ 10% mutation frequency is more adequate.

When you speak about ~100% of samples with a gene mutated, do you speak about all samples carrying the same mutation in the same gene or different mutations in the same gene across samples? do you consider all mutations or only high-impact ones (usually, cancer analysis provide frequencies of high-impact mutations)? To further understand this question, could you also show me a few examples of these recurrent "false" variants across samples (header + lines from the out file obtained by SComatic/scripts/BaseCellCalling/BaseCellCalling.step2.py )?

- Likewise, we observe mutations not called in WES data from the same samples.

These types of variants (scRNA-seq-specific variants) do not need to be false positive calls, as many other variables such as cancer heterogeneity, purity, and coverage in WES play an important role (check Figures 2 and 3 from the bioRxiv manuscript). Nevertheless, we can investigate what is going on in this case. Could you tell me the mean coverage achieved in the WES data? One important thing to consider here is that scRNA-seq (like 10x) usually expands regions not covered by WES approaches, so could you check (for instance opening the WES bam in IGV) if you can see reads supporting the alternative alleles for a subset of variants detected by scRNA-seq and not by your WES pipeline?

I'll try using a more detailed cell type annotation next, e.g. non tumor type A, non tumor type B, tumor cell type... Are there additional parameters you would recommend to tweak?

Increasing the number of cell types helps to detect better systematic errors and germline variants. If you can increase this number would be awesome. Regarding parameters, we generated all our results with the parameters described in the Example of how to run SComatic section. Please, use these parameters whenever possible.

I ran the tool using the genome.fa file from Cellranger as reference. I included the PoN and RNA-editing files, included in this tool.

Could you confirm that this reference genome is the Hg38 version of the human genome with chromosome prefixes (chr1, chr2...)? This is crucial to be able to use both the PoN and RNA-editing files that we provided.

Additional filters that might be helpful:

Create your own Panel of Normals. This will help you to remove recurrent artefacts across samples or biases specific to your dataset. Once you run SComatic/scripts/BaseCellCalling/BaseCellCalling.step2.py with our PoN, you can re-run again this script but with your own PoN.
Estimate new parameters for the Beta-Binomial distribution. Although our Beta-Binomial parameters were estimated with many samples, other datasets might have different background noises, so you could try to estimate new alpha and beta parameters with your data. However, this computation requires a high number of tumour-free samples.
Ignore variants with > 1 % population frequency in the gnomAD database to remove undetected germline variants.

Of course, feel free to contact me if you want to discuss your project in a more private chat (fmuyas@ebi.ac.uk).

Cheers, Fran

cortes-ciriano-lab / SComatic

High (False)-Discoveries #4