Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

create artificial truth-set for validation and variant-calling optimisation #1049

Open mathiasbio opened 1 year ago

mathiasbio commented 1 year ago

Is your feature request related to a problem? Please describe.

At the moment there are many validation-samples used in verifying the quality of the new releases of balsamic, however the list of known variants appears to be quite limited (~100), and it is difficult to get a true measure of the sensitivity of the pipeline, and the precision is entirely unmeasured as it stands now. Due these limitations it is also difficult to evaluate from one version of balsamic to another, the effect of changes on filtration on the quality of the variant calling.

Describe the solution you'd like

An artificial truthset could be created by tools like Bamsurgeon. Where it is possible to supply a bamfile, and an optional bedfile of regions within which to insert random variants. It is possible to specify the type of variant, SNVs / InDels, and a range of VAF, and of course how many variants you want inserted.

If such a truthset is desired, it might be optimal to generate 1 each per data-type; TGA, exome, WGS.

An additional recommendation is to rely on a deep sequenced normal sample as the base to insert variants in, as the tumor samples and even germline cell-line reference samples will contain naturally occurring rare frequency mutations that are not included in the truthset and therefore can't be evaluated.

For this evaluation the the call-set should not be filtered by any criteria related to clinical interpretation.

Describe alternatives you've considered If possible, a clear and concise description of any alternative solutions or features you've considered.

Additional context If possible, add any other context or screenshots about the feature request here.

Expected output for the feature If possible, an example of expected output

Current BALSAMIC version balsamic --version

vwirta commented 1 year ago

@mathiasbio have you seen Mutacc? Would be interesting to see if that could be adapted for somatic variants as well.

mathiasbio commented 1 year ago

I didn't know this tool existed, if we decide we want to create a synthetic truthset I'll keep this tool in mind and evaluate which one would best suit our needs. It's been a couple of years since I last used bamsurgeon too, maybe there are better tools available now!