ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

Some variants are missing in river.tsv #114

Open pigyun906 opened 10 months ago

pigyun906 commented 10 months ago

Dear Christoffer,

I have noticed that some mutations, which are present in my VCF file, are not reflected in the river.tsv(xls) file.

I wonder why this happened, and how can I solve it? For example, is there a manual way to incorporate missed mutations into clone clustering?

Thank you for your time and assistance.

Best regards, Jiyun

ChristofferFlensburg commented 10 months ago

SuperFreq roughly goes as

1) sort variants into somatic, germline heterozygous, and other (noise, germline homozygous, ..) 2) use germline hets and read counts for CNA calling 3a) use the most reliable somatic variants and CNAs ("anchor mutations") to identify clones 3b) assign less reliable somatic variants and CNAs to identifies clones if there is a good match

So for a somatic variant to make it into the river it first needs to be identified as somatic in 1, and then needs to either be of high enough quality (in terms of high read depth, high VAF, base/mapping quality, clean matching normals, etc) to be an anchor mutation in 3a, or match an identified clone well enough to be added in 3b.

I know that it seems weird that a variant can be called as somatic, but then not included in the river. It's a sensitivity/accuracy compromise, where we know that there will usually be at least a few false somatic calls, and if we were to force a clone out of every somatic variants, then we would also pick up a lot of noise clones, ie clones formed out of false variants. The 2-step approach in 3a and 3b is a compromise that drastically reduce the number of false clones called, while still retaining lower quality variants in the clonal tracking, but only if there are higher quality variants supporting a cell population. The potential issue is that we might lose real clones with important driver mutation if none of the mutations are classed as high quality anchor mutation, in particular in low mutation rate scenarios, such as most leukemias.

There's no support for whitelisting variants for the clonal tracking, you'd have to go in and edit or manually run the internal functions, which is possible but messy. I think a better approach would be to use the clonal tracking mostly for identifying clones, ie cell populations, and then use the somatic variant calls (in somaticVariants, or the output VCFs for example) to study driver mutations. Then you avoid issues in the step where mutations are assigned to clones.

I hope that's of help.