Open Rohit-Satyam opened 2 years ago
I came across a recent collaborative preprint by NCBI and several other institutes on best protocols for generation of Consensus assembly (see here). They recommend the following protocol
I also observed that you assign the Variants just based on VAF and not Allele Depth. For Example a variant might be supported by just 10X reads out of which 9 support it's presence. In that case VAF will be 0.8 but the depth at that region is not enough to say with confidence.
Another example from real data
MN908947.3 4184 . G A 42.872 PASS DP=69;DPS=30,39;Pool=2;vafator_af=0.88406;vafator_ac=61;vafator_dp=69;vafator_eaf=0.5;vafator_pu=1;vafator_pw=1;vafator_k=4;vafator_bq=4.5,13;vafator_mq=60,60;vafator_pos=182,175;vafator_rsmq=-0.055;vafator_rsmq_pv=0.95646;vafator_rsbq=2.088;vafator_rsbq_pv=0.03677;vafator_rspos=0.491;vafator_rspos_pv=0.62317 GT:GQ 1:43
Here we have DP=69
and if we were to set a Depth cutoff of 100 reads (As suggested by paper above) as well with VAF cutoff this variant won't pass.
Just added in #37 the FASTA sequences to the output.
Will follow up on the rest, the filtering by depth requires a thorough analysis from our side. The whole dataset of mutations in tabular format is publicly available here https://covigator.tron-mainz.de/download/variant_observation.csv.gz, that contains the VAF, DP and AC annotations for every variant call so the impact of such filtering could be evaluated there.
Hi @priesgo. Were you able to check the impact of such filtering. I realized that using criteria INFO/vafator_af < 0.5 || INFO/vafator_dp < 100 || INFO/vafator_ac < 50
affects clade assignment in some samples and filters a lot of reverse substitutions (Private Mutations) according to NextClade.
Apologies for ultra-late response @Rohit-Satyam , we have not looked into this yet. We will eventually get here.
FYI @johausmann
Stumbling upon this here. As a consensus sequence consumer - I'm very aware of some pipelines having a strong tendency to erroneously call reference when we know that it's unlikely to be the case due to phylogenetic placement.
What @Rohit-Satyam explains appears to be a possible mechanism to explain that.
@priesgo do you have a rough indication of how many sequences in GISAID were produced with this pipeline? Are there some labs that you know are using it? Unfortunately, submitters don't usually point out the pipeline they used.
Dear @priesgo
While going through the pipeline, I realized that the VCF2FASTA does not masks the regions of low coverage. In amplicon sequencing data, there might be regions deprived of reads. I think it make sense to mask such regions with Ns that we know are covered by less than 20 reads as we can't say about variant status in that region. Using the reference sequence in such regions might indicate that there is no variant in that region.
Besides, could you enable the pipeline to copy these fasta assemblies as well to the result directory. I tried using
--keep_intermediate
but it didn't help. I can see there is nopublishDir "${params.output}", mode: "copy"
forVCF2FASTA
process.EDIT 1: Also, I think the code take the normalized VCFs and without filtering the Low Frequency and Subclonal variants, it includes all of them in the assembly. Is this assembly good for GISAID submission? Because as you say in your documentation
Other low quality mutations are removed from the output.