Variants long table CSV - clarifications

tavareshugo commented 2 years ago

Need to clarify what some of the columns in variants_long_table.csv file indicate. There's some inconsistencies sometimes, for example: REF_DP + ALT_DP doesn't add up; for indels, often REF_DP = ALT_DP with AF = 1.

It's also quite hard to know which of those mutations are actually part of the final consensus sequence.

Would be good to clarify these things.

tavareshugo commented 2 years ago

For example, the column FILTER includes PASS, even if a variant doesn't make it to the final consensus. For Illumina, ivar consensus removes variants below a certain threshold of AF, but this is not indicated in that variants table.

We can see which variants are in the final consensus from the VCF files in

Illumina: variants/ivar/consensus/bcftools/*.filtered.vcf.gz
Nanopore: variants/medaka/*.pass.unique.vcf.gz (?? not sure - need to double-check)

tavareshugo commented 2 years ago

Possibly use bcftools merge | bcftools query to do this.

The bcftools query command used in the pipeline is here (notice it's slightly different depending on the caller).

tavareshugo commented 2 years ago

The strategy to bcftools merge | bcftools query doesn't really work, because:

After merging, samples without a variant in a given position are given ./.. It can be set to output 0/0, but that's not accurate either, because the samples may have had missing data in those positions.
The output of the query is not in "long" format, which is also not ideal for downstream analysis.

An alternative (which is a bit more involved) is to do the following:

# Create a shell variable with the sample names from our clean FASTA file
SAMPLES=$(grep ">" report/consensus.fa | sed 's/>//')

# Create a CSV file containing the column names of our new table
echo "sample,chrom,ref,alt" > report/variants.csv

# Use a for loop to run bcftools query on each sample
# adding the result of each iteration to the CSV file we created above
for SAMPLE in $SAMPLES
do
  bcftools query -f "${SAMPLE},%CHROM,%POS,%REF,%ALT\n" results/viralrecon/medaka/${SAMPLE}.pass.unique.vcf.gz >> report/variants.csv
done

Which results in a CSV file in long format. This is probably enough for reporting, etc.

tavareshugo commented 2 years ago

This is my current understanding about variants results:

--platform nanopore --> only uses filtered variants for all analysis (MultiQC, SnpEff, long table).
--platform illumina --> the "# SNPs" and "# INDELs" reported in the MultiQC table are for filtered variants. However, the long table and SnpEff analysis include all variants (including those that do not make it through the 0.75 threshold).
So, if one wants a list of the actual variants that make it through to the consensus, then filter this table for AF > 0.75 and DP > 10.

I have also empirically checked this on a set of 48 samples each on Illumina and Nanopore pipelines.

tavareshugo commented 2 years ago

See #19 for details of where this is found in the pipeline

cambiotraining / sars-cov-2-genomics

Variants long table CSV - clarifications #23