andersen-lab / ivar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
https://andersen-lab.github.io/ivar/html/
GNU General Public License v3.0
115 stars 39 forks source link

Export variants to VCF format #17

Open gkarthik opened 4 years ago

gkarthik commented 4 years ago

ivar variants outputs results only in custom .tsv format so unusable with other tools.

Allow option to export variants in VCF v4.3 format.

drpatelh commented 4 years ago

@svarona wrote a simple Python script to do this for the nf-core/viralrecon pipeline if its useful @gkarthik.

paulstretenowich commented 4 years ago

Hi, for deletions ivar is outputting, for a reference having ACGT and a read having a deletion of CGT:

REF ALT A -CGT

And general VCF format is:

REF ALT ACGT A

So in your python script you should do something like:

ref = line[2]
alt = line[3]
if alt[0] == "-":
    ref += alt[1:]
    alt = line[2]

Paul

drpatelh commented 4 years ago

Thanks @paulstretenowich Ill fix that 👍

drpatelh commented 4 years ago

Having looked at this properly now @paulstretenowich it appears that there was a bug in the way insertions were being reported too as fixed here

paulstretenowich commented 4 years ago

You are right @drpatelh I did just a naive fix on a specific case I was facing but I suspected other cases. Thanks 👍

fanninpm commented 3 years ago

Summary

I ran a few of iVar's franken-VCF files (from the Cecret workflow) through the vcfR package, but not all of the expected graphs showed up when I generated a chromR object and plotted it. On the iVar end, it would be nice to have the MQ and QUAL fields. (The DP field seems to already be present, however it isn't used for some reason.)


Reproducing

I have included the following instructions for retracing my steps (some of this is aimed more towards the people coming from the vcfR repo):

Steps to reproduce: System Requirements: - Downloading the initial FASTQ files: - `sra-tools` (available on [Bioconda](https://anaconda.org/bioconda/sra-tools)) - Generating the VCF file: - the `master` branch of https://github.com/UPHL-BioNGS/Cecret (you actually have to download this, as the repo itself has no `main.nf` file) - `nextflow` (available on [Bioconda](https://anaconda.org/bioconda/nextflow)) - `docker` or `singularity` (the latter is available on [conda-forge](https://anaconda.org/conda-forge/singularity)) - gigabytes of space for the Docker/Singularity images (you can prune this down by turning things off in a nextflow `{filename}.config` file) - Using vcfR to generate the plots: - a recent enough version of R (I was using version 3.5.2) - `vcfR` (available on CRAN, but you can also get it from [conda-forge](https://anaconda.org/conda-forge/r-vcfr) if you prefer downloading R and vcfR via conda) - (OPTIONAL) RStudio (so you can easily interact with R and save the plots) ### Step 1 (downloading the initial FASTQ files): - Create a directory in which you will hold your paired-end FASTQ gz archives. Navigate to that directory. - Run the following commands (the `-p` flag is optional, and displays a progress bar): ```none prefetch -p SRR11953924 fasterq-dump -p --split-files SRR11953924/SRR11953924.sra ``` Here, I chose SRR11953924 as an example, as it seemed to be the most interesting. ### Step 2 (generating the VCF file): - (OPTIONAL) Write a config file that disables what you don't need (this prevents docker/singularity from downloading excessive containers). Here's an example (note that `process.ivar_variants` is the process that runs `ivar variants`, so it should remain set to `true`): ```none params.fastqc = false params.ivar_variants = true params.samtools_stats = false params.samtools_coverage = false params.samtools_flagstat = false params.samtools_ampliconstats = false params.bedtools_multicov = false params.nextclade = false params.snpdists = false params.iqtree = false ``` - Run the Cecret workflow, generating the VCF file, with a command similar to: ```none nextflow -C {your_config_file}.config run Cecret.nf -c configs/singularity.config {directory_from_step_1} ``` ### Step 3 (using vcfR to generate the plots): - NOTE: Because the SARS-CoV-2 genome is only ~30 kbp in size, I found a window size of 500 (the default is 1000) worked best for display purposes. - Run the following R commands (folks coming from the iVar repository can feel free to ignore the spurious warning about the names not matching up): ```R library(vcfR) dna <- ape::read_dna("{cecret_directory}/configs/MN908947.3.fasta", format = "fasta") gff <- ape::read_gff("{cecret_directory}/configs/MN908947.3.gff") vcf <- read.vcfR("{path_to_your_cecret_output}/ivar_variants/SRR11953924.ivar_variants.vcf", verbose = TRUE) chrom <- create.chromR(name = "SRR11953924", vcf = vcf, seq = dna, ann = gff) chrom <- masker(chrom) chrom <- proc.chromR(chrom, win.size = 500, verbose = TRUE) plot(chrom) chromoqc(chrom) ```

Here's the chromR plot (window size 500, notice that 3 out of the 4 plots are missing, compared to the first plot found on this page): image

Here's the chromoqc plot (window size 500, the low number of variants per site is normal for this sample; however, the top three plots are missing, compared to the last plot found on this page): image

Moslemecoh commented 5 months ago

Filter variants across replicates with iVar The following is the command for filter variants: ivar filtervariants -t 0.5 -p test.filtered P1_S1.tsv P2_S2.tsv P3_S3.tsv

But always encountering the error message below. Header format of P1_S1.tsv did not match! Please use files generated using "ivar variants" command

I have double checked the header format of those file which do not have any change. Even though, I have changed tsv to txt file. What could be the problem and solution to resolve this issue.

cmaceves commented 5 months ago

Hi @Moslemecoh! Thanks for bringing this up - I think the better place for this conversation is #178. If you could redirect there, and possibly share the files from the command I can take a look.