Sentieon / sentieon-scripts

Helper scripts for biological data processing from Sentieon
BSD 2-Clause "Simplified" License
63 stars 21 forks source link

merging vcf files from the same sample before VQSR and CNV #6

Open grendon opened 2 years ago

grendon commented 2 years ago

The samples were sequenced in two batches at different points in time, Each batch was analyzed separately with the Sentieon pipeline and we now have two vcfs per sample. How can we merge the calls in the vcfs before running VQRS and CNV on each sample?

DonFreed commented 2 years ago

Merging information from the two VCFs is somewhat complex. There will be some discordant calls and it is not clear how the discordant calls should be handled.

Instead of merging the VCFs, you might pass the BAM files from both batches to the variant caller so that it can take advantage of the read information in both datasets to make the most accurate variant calls. If the BAM files from both batches have the same sample readgroup (RGSM), you can pass both BAM files directly to the variant caller:

sentieon driver -i <sample_batch1.bam> -i <sample_batch2.bam> -r <ref> ... \
  --algo Haplotyper ... sample_jointCalls.vcf.gz

If the BAM files from the two batches have different RGSM tags, you might use samtools reheader to replace the RGSM information in one of the BAM files.

The result would be similar to starting from the fastq for both batches and then processing the data using the multi-FASTQ.sh script.