TRON-Bioinformatics / covigator-ngs-pipeline

A Nextflow pipeline for NGS variant calling on SARS-CoV-2. From FASTQ files to normalized and annotated VCF files from GATK, BCFtools, LoFreq and iVar.
MIT License
18 stars 7 forks source link

Discussion on Masking the regions with coverage less than 20X #35

Open Rohit-Satyam opened 2 years ago

Rohit-Satyam commented 2 years ago

Dear @priesgo

While going through the pipeline, I realized that the VCF2FASTA does not masks the regions of low coverage. In amplicon sequencing data, there might be regions deprived of reads. I think it make sense to mask such regions with Ns that we know are covered by less than 20 reads as we can't say about variant status in that region. Using the reference sequence in such regions might indicate that there is no variant in that region.

Besides, could you enable the pipeline to copy these fasta assemblies as well to the result directory. I tried using --keep_intermediate but it didn't help. I can see there is no publishDir "${params.output}", mode: "copy" for VCF2FASTA process.

EDIT 1: Also, I think the code take the normalized VCFs and without filtering the Low Frequency and Subclonal variants, it includes all of them in the assembly. Is this assembly good for GISAID submission? Because as you say in your documentation Other low quality mutations are removed from the output.

Rohit-Satyam commented 2 years ago

I came across a recent collaborative preprint by NCBI and several other institutes on best protocols for generation of Consensus assembly (see here). They recommend the following protocol

*DAVID*

Rohit-Satyam commented 2 years ago

I also observed that you assign the Variants just based on VAF and not Allele Depth. For Example a variant might be supported by just 10X reads out of which 9 support it's presence. In that case VAF will be 0.8 but the depth at that region is not enough to say with confidence.

Another example from real data

MN908947.3  4184    .   G   A   42.872  PASS    DP=69;DPS=30,39;Pool=2;vafator_af=0.88406;vafator_ac=61;vafator_dp=69;vafator_eaf=0.5;vafator_pu=1;vafator_pw=1;vafator_k=4;vafator_bq=4.5,13;vafator_mq=60,60;vafator_pos=182,175;vafator_rsmq=-0.055;vafator_rsmq_pv=0.95646;vafator_rsbq=2.088;vafator_rsbq_pv=0.03677;vafator_rspos=0.491;vafator_rspos_pv=0.62317  GT:GQ   1:43

Here we have DP=69 and if we were to set a Depth cutoff of 100 reads (As suggested by paper above) as well with VAF cutoff this variant won't pass.

priesgo commented 2 years ago

Just added in #37 the FASTA sequences to the output.

Will follow up on the rest, the filtering by depth requires a thorough analysis from our side. The whole dataset of mutations in tabular format is publicly available here https://covigator.tron-mainz.de/download/variant_observation.csv.gz, that contains the VAF, DP and AC annotations for every variant call so the impact of such filtering could be evaluated there.

Rohit-Satyam commented 1 year ago

Hi @priesgo. Were you able to check the impact of such filtering. I realized that using criteria INFO/vafator_af < 0.5 || INFO/vafator_dp < 100 || INFO/vafator_ac < 50 affects clade assignment in some samples and filters a lot of reverse substitutions (Private Mutations) according to NextClade.

priesgo commented 1 year ago

Apologies for ultra-late response @Rohit-Satyam , we have not looked into this yet. We will eventually get here.

FYI @johausmann

corneliusroemer commented 1 year ago

Stumbling upon this here. As a consensus sequence consumer - I'm very aware of some pipelines having a strong tendency to erroneously call reference when we know that it's unlikely to be the case due to phylogenetic placement.

What @Rohit-Satyam explains appears to be a possible mechanism to explain that.

@priesgo do you have a rough indication of how many sequences in GISAID were produced with this pipeline? Are there some labs that you know are using it? Unfortunately, submitters don't usually point out the pipeline they used.