cambiotraining / sars-cov-2-genomics

Course materials for an introduction to SARS-CoV-2 sequencing data analysis
http://cambiotraining.github.io/sars-cov-2-genomics/
Other
14 stars 7 forks source link

Variants long table CSV - clarifications #23

Open tavareshugo opened 2 years ago

tavareshugo commented 2 years ago

Need to clarify what some of the columns in variants_long_table.csv file indicate. There's some inconsistencies sometimes, for example: REF_DP + ALT_DP doesn't add up; for indels, often REF_DP = ALT_DP with AF = 1.

It's also quite hard to know which of those mutations are actually part of the final consensus sequence.

Would be good to clarify these things.

tavareshugo commented 2 years ago

For example, the column FILTER includes PASS, even if a variant doesn't make it to the final consensus. For Illumina, ivar consensus removes variants below a certain threshold of AF, but this is not indicated in that variants table.

We can see which variants are in the final consensus from the VCF files in

tavareshugo commented 2 years ago

Possibly use bcftools merge | bcftools query to do this.

The bcftools query command used in the pipeline is here (notice it's slightly different depending on the caller).

tavareshugo commented 2 years ago

The strategy to bcftools merge | bcftools query doesn't really work, because:

An alternative (which is a bit more involved) is to do the following:

# Create a shell variable with the sample names from our clean FASTA file
SAMPLES=$(grep ">" report/consensus.fa | sed 's/>//')

# Create a CSV file containing the column names of our new table
echo "sample,chrom,ref,alt" > report/variants.csv

# Use a for loop to run bcftools query on each sample
# adding the result of each iteration to the CSV file we created above
for SAMPLE in $SAMPLES
do
  bcftools query -f "${SAMPLE},%CHROM,%POS,%REF,%ALT\n" results/viralrecon/medaka/${SAMPLE}.pass.unique.vcf.gz >> report/variants.csv
done

Which results in a CSV file in long format. This is probably enough for reporting, etc.

tavareshugo commented 2 years ago

This is my current understanding about variants results:

I have also empirically checked this on a set of 48 samples each on Illumina and Nanopore pipelines.

tavareshugo commented 2 years ago

See #19 for details of where this is found in the pipeline