Gleeson-Lab / wxs_pipeline

Starting with BAMs and FASTQs, follow GATK 4.0 Best Practices up to generating a joint-genotyped VCF
1 stars 1 forks source link

Add single-sample variant calling #10

Closed brcopeland closed 2 years ago

brcopeland commented 2 years ago

For different contexts and analyses it can be beneficial to have VCFs for single samples as well, so it would be sensible to add that functionality in here.

shishenyxx commented 2 years ago

MosaicHunter, DeepMosaic, Mutect2-Strelka2 pipelines, and the variant annotation pipeline have the "near indel" annotation or filter. Vcf from each sample is required otherwise all mosaic candidates near any indel found in any of the samples would be removed. The individual vcf should be generated for all WGS, WES, AmpliSeq samples.

brcopeland commented 2 years ago

Good to know; I will be planning to work on this soon.

shishenyxx commented 2 years ago

I think only keeping the separated .g.vcf(.gz) for each file and one combined .gvcf(.gz) file for all input might be already enough, what do you think?

brcopeland commented 2 years ago

I personally like the idea of retaining a joint VCF, single sample VCFs, and single sample gVCFs. The reason for the latter is in the case of wanting in the future to perform joint genotyping with various samples. Does that make sense to you?

shishenyxx commented 2 years ago

Yeah makes a lot of sense ... just single gvcfs are larger (say for each ampliseq file the bam itself is 1G, but the gvcfs are 3G for each ...)

brcopeland commented 2 years ago

That doesn't match my experience. As an example one of Changuk's BAMs is 1021 MB while the gvcf is 3.3 MB.

His 144 BAMs take up 201 GB while 142 gvcfs take up 574 MB.

brcopeland commented 2 years ago

I implemented this in c9d4fa6791b009e23f9edf945b493335524f6b2b.