cortes-ciriano-lab / SComatic

A tool for detecting somatic variants in single cell data
Other
173 stars 28 forks source link

pysam.pileup low max depth leads to FP #63

Closed ArthurDondi closed 2 months ago

ArthurDondi commented 3 months ago

pysam.pileup max_depth argument is by default 8000, however for RNA-seq there are edge cases as below, where a germline intronic SNP in a strongly expressed gene will appear as a false-positive somatic SNV:

Capture d’écran (170)

Top row is the cancer cells sample, mutated with 14% VAF and 19% CCF (max depth in exon ~17000) and the bottom row is the non-cancer cells sample with 4% VAF and 6% CCF (max depth in exon ~70000). This looks like a germline SNP.

With the current max_depth=8000 I got a PASS mutation because only a fraction of the 475 reads are processed : chr11 61965525 61965525 G A PASS Cancer AAGGG AACAC 1 25 15 4 4 0.16 0.2667 0.0001 0.0 2 2 1;115;0.1465 1;92;0.1055 . PASS DP|NC|CC|BC|BQ|BCf|BCr 94|81|1:0:0:79:0:0|1:0:0:92:0:0|93:0:0:8556:0:0|0:0:0:14:0:0|1:0:0:78:0:0 25|15|4:0:0:13:0:0|4:0:0:21:0:0|372:0:0:1953:0:0|0:0:0:2:0:0|4:0:0:19:0:0

If I raise it to, say, max_depth=200000 it's a Cell_type_noise, as it should: chr11 61965525 61965525 G A,A Cell_type_noise NonCancer,Cancer AAGGG AACAC 1 479,44 239,26 19,6 15,5 0.0397,0.1364 0.0628,0.1923 0.0017,0.0 0.0,0.0 2 2 1;498;0.3121 1;245;0.2041 . BetaBin_problem,PASS DP|NC|CC|BC|BQ|BCf|BCr 479|239|15:1:0:233:0:0|19:1:0:453:0:0|1767:93:0:42129:0:0|0:0:0:48:0:0|19:1:0:405:0:0 44|26|5:0:0:25:0:0|6:0:0:37:0:0|558:0:0:3441:0:0|0:0:0:3:0:0|6:0:0:34:0:00

pysam.pileup occurs in BaseCellCounter.py and SingleCellGenotype.py. Let me know if you're interested in a PR. There is also a filter bug in basecallingstep1 as signaled in #16.

velten-lab commented 3 months ago

Thanks for spotting this! I agree that it's important to document that SComatic uses a default max_depth of 8000, and to add command line options to BaseCellCounter.pyand SingleCellGenotype.py that allow to modify max_depth. We're working with (very deeply sequenced) mission bio data, and increasing max depth made a day-and-night difference for us.

ArthurDondi commented 3 months ago

For their defense, it's a samtools/pysam default behaviour that is quite unexpected (at least to me).

It can have a big impact on very deep sequencing but also in low-depth + high-cell count when using SingleCellGenotype.py

isidroc commented 3 months ago

Dear ArthurDondi , Thanks for spotting this issue. Yes, a PR would be great, thanks a lot

Francesc-Muyas commented 2 months ago

Dear users, Thanks for this interesting finding. We will accept the PR in the coming days.

Thanks again for your feedback, Fran

isidroc commented 2 months ago

We have now accepted the PR. Thanks!