Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

Make sure all final VCFs come with strand bias stats for each SNV/INDEL #415

Open hassanfa opened 4 years ago

hassanfa commented 4 years ago

Is your feature request related to a problem? Please describe. Variant calling procedure infers genotype from positive and negative strand for paired end reads. The genotype can be different between these two strands (AD, DP, and such).

Describe the solution you'd like re-calculate strand bias given the evidence coming from BAM/CRAM file in final VCF. Two methods for strand bias can be used:

  1. Fisher's exact test
  2. Odds ratio

Describe alternatives you've considered Rely on each variant caller's result.

Additional context None for now.

Expected output for the feature A field that clearly describes the SB value. Different variant callers report different type of SB values (OR, 1-SB, Fisher pval, etc). A unified value for final VCF, and Scout ready!

Current BALSAMIC version 5.1.0

hassanfa commented 4 years ago

Python libraries are horribly optimized for the task of reading and writing BAM/VCF files. The solution should NOT be implemented in Python (PySam, Pyvcf, etc). Investigate other solutions.

hassanfa commented 4 years ago

Things to do:

  1. Only for SNVs for now (INDELs are a hard problem)
  2. Don't use Python-Pandas. It is extremely slow and can't use multithreading. Alternative: R-data.tables
  3. Update to GATK4
  4. Simplify normalization of variants: this should be easy given that INDELs will be excluded
hassanfa commented 3 years ago

Solutions:

  1. Use https://github.com/hassanfa/VCFmerge (tested and works, although it is slow).
  2. Use vcfanno and a lite version of VCFMerge to generate a final somatic VCF.
pbiology commented 1 year ago

Refinement 2022-12-13: