genetronhealth / uvc

UVC, a very accurate small-variant caller (https://doi.org/10.1093/bib/bbab458)
BSD 3-Clause "New" or "Revised" License
13 stars 4 forks source link

Parameters or preprocessing steps to improve uvc performance #12

Open BrettLiddell opened 6 months ago

BrettLiddell commented 6 months ago

Hello,

I used mutect2, vardict, and varscan to call variants on matched tumour-normal bam files containing reads that were UMI-collapsed with fgbio's CallDuplexConsensusReads and stitched using illumina's pisces/gemini. I used UVC on reads that had UMIs extracted and placed in the reads QNAME before being realigned to hg19 (but with no UMI collapsing or read stitching).

The variants called by Mutect2 (with FilterMutectCalls applied) align well with the gold standard list of variants for this matched tumour-normal pair. The variants predicted by varscan with basic filters applied (VAF > 0.05, somatic_status=Somatic) also align well with this goldstandard list of variants. Vardict labels many passing variants, but also captures many of the gold standard variants.

UVC is capturing 3 less variants than mutect2/vardict and 5 less variants than varscan. I am wondering if I should change any parameters when running uvc or perform additional steps in the preprocessing of my tumour-normal bam files. Below is an outline of my current workflow:

Current preprocessing steps:

  1. picard FastqToSam
  2. fgbio ExtractUMIsFromBam (get reads in originalName#UMI format)
  3. picard SamToFastq
  4. bwa (for alignment)

Current uvc script: uvc='/software/uvc/0.14.2.15f4adc/bin/uvcTN.sh' export UVC_BIN_EXE_FULL_NAME=/software/uvc/0.14.2.15f4adc/bin/uvc-1

$uvc ${REF_GENOME} ${TUMOUR_BAM} ${NORMAL_BAM} \ ${OUTPUT_PATH}/colo829_uvc.vcf "COLO829_S8,COLO829_BL_S7"

Apologies for the long description, thank you for your advice.

genetronhealth commented 6 months ago

Hi,

Thank you for trying out UVC.

You can try using a less stringent QUAL threshold to retain more variants. For example, (bcftools view -i "QUAL>=50" ${OUTPUT_PATH}/colo829_uvc.vcf) instead of (bcftools view -i "QUAL>=60" which is equivalent to bcftools view -fPASS) to get more variants. Please note that this also increases the number of false positive calls.

You can check the FORMAT / FT field to see the filter strings of the true positive variants that failed to be called. Descriptions of the filter strings are provided in the VCF header.

Please be aware that, the recall rate, all by itselt, is not useful as a metric. Instead, the curve of precision as a function of recall is more useful because we can see both false positive and false negative calls. My guess is that: VarDict has very hig sensitivity and very low specificity , VarScan has high sensitivity and low specificity , Mutect has low sensitivity and high specifity (using FilterMutectCall), and UVC has very low sensitivity and very high specificity (using FILTER=PASS)?

In order to provide more detailed answer, I need to know: what are the duplication rates, UMI family sizes and assay types (PCR-amplicon or hybrid capture)? And does the COLO829 normal also have UMI? You can send me an email to cndfeifei@gmail.com if you do not want to disclose such information publicly.

Also, accoding to my previous personal experience, UMI is not useful for improving the performance of calling variants with VAF > 0.05.

Best