Open tavareshugo opened 2 years ago
For example, the column FILTER
includes PASS
, even if a variant doesn't make it to the final consensus.
For Illumina, ivar consensus
removes variants below a certain threshold of AF, but this is not indicated in that variants table.
We can see which variants are in the final consensus from the VCF files in
variants/ivar/consensus/bcftools/*.filtered.vcf.gz
variants/medaka/*.pass.unique.vcf.gz
(?? not sure - need to double-check)Possibly use bcftools merge | bcftools query
to do this.
The bcftools query
command used in the pipeline is here (notice it's slightly different depending on the caller).
The strategy to bcftools merge | bcftools query
doesn't really work, because:
./.
. It can be set to output 0/0
, but that's not accurate either, because the samples may have had missing data in those positions. An alternative (which is a bit more involved) is to do the following:
# Create a shell variable with the sample names from our clean FASTA file
SAMPLES=$(grep ">" report/consensus.fa | sed 's/>//')
# Create a CSV file containing the column names of our new table
echo "sample,chrom,ref,alt" > report/variants.csv
# Use a for loop to run bcftools query on each sample
# adding the result of each iteration to the CSV file we created above
for SAMPLE in $SAMPLES
do
bcftools query -f "${SAMPLE},%CHROM,%POS,%REF,%ALT\n" results/viralrecon/medaka/${SAMPLE}.pass.unique.vcf.gz >> report/variants.csv
done
Which results in a CSV file in long format. This is probably enough for reporting, etc.
This is my current understanding about variants results:
--platform nanopore
--> only uses filtered variants for all analysis (MultiQC, SnpEff, long table). --platform illumina
--> the "# SNPs" and "# INDELs" reported in the MultiQC table are for filtered variants. However, the long table and SnpEff analysis include all variants (including those that do not make it through the 0.75 threshold).I have also empirically checked this on a set of 48 samples each on Illumina and Nanopore pipelines.
See #19 for details of where this is found in the pipeline
Need to clarify what some of the columns in
variants_long_table.csv
file indicate. There's some inconsistencies sometimes, for example:REF_DP
+ALT_DP
doesn't add up; for indels, oftenREF_DP
=ALT_DP
withAF
= 1.It's also quite hard to know which of those mutations are actually part of the final consensus sequence.
Would be good to clarify these things.