NetaZuckerman / covid19

SARS-Cov2 identification pipeline
MIT License
7 stars 2 forks source link

covid19

CoronaPipeline conda environment

  1. Produce reports for each fastq file in fastq/raw (or any other location of your choice) \ conda activate CoronaPipeline\ python3 QC.py --reports -i fastq/raw/

if trimming is not necessary skip to step 6.

  1. Create a template csv file for convenient trimming. See details at Template. \ python3 QC.py --template -i fastq/raw/ \ Take a look at template.csv produced as output and fix some trimming parameters according to fastqc/multiqc reports. Scroll down to read more about the csv file.

  2. Trim fastq files according to the csv file created in (2). \ python3 QC.py --trim template.csv -i /input/path/to/fastq/files -o /trimmed/path

  3. Produce second reports to make sure you're happy with the trimming results. \ python3 QC.py --reports -i /trimmed/path -o /output/fastqc/2/path \ Be sure to provide the path to the trimmed fastq files this time. Repeat steps 2-5 if more trimming is needed.

  4. Run pipeline to produce consensus sequences for your samples, and get additional info such as number of reads, depth, coverage, etc. in results/report.txt. Run the pipeline on the trimmed fastq files (the pipeline uses the paired files, ignoring singletons). \ -i: input fastq files path \

    • Be sure to deactivate environment before running the pipeline.sh! Conda is automatically activated in the pipeline as well as pangolins environment.

    conda deactivate\ bash pipeline.sh -i fastq/trimmed/ -r /refs/refseq.fa

Additional info: QC.py

Usage:

python QC.py [-h] [-t [CSV] | --template | -r] [-i WD] [-o OUTPUT_PATH]

-r | --reports : QC Reports

Produces fastqc and multiqc reports for all fastq files. \ -i : path to fastq files locations. default: fastq/raw/ \ -o : path to drop the template.csv. default: QC/ \ python3 QC.py -r -i path/to/fastq/files -o path/for/output/reports \ To create QC reports with default input and output arguments: python3 QC.py -r

--template : Template of trimmomatic command for each sample

-i : path to fastq files location. default: fastq/raw/ \ -o : path to drop the template.csv. default: working directory \ python3 QC.py --template -i some/path/to/fastq/location/ -o some/path/for/output \ To create template with default input and output arguments: \ python3 QC.py --template

-t|--trim <path/to/csv> <optional-path/to/base/dir> : Trim fastq files with trimmomatic, according to csv file.

-i : path to fastq files location. default: fastq/raw/ \ -o : path to drop the trimmed samples. default: fastq/trimmed \ python3 QC.py -t path/to/file.csv -i path/to/raw/fastq/files -o path/to/trimmed/fatsq/output \ To trim fastq files with default input and output arguments: \ python3 QC.py -t path/to/file.csv or python3 QC.py --trim path/to/file.csv\ To trim with the template csv as input, simply run python3 QC.py -t template.csv.

The CSV file:

Template csv header:

ends, input_forward, input_reverse, phred, threads, ILLUMINACLIP:, SLIDINGWINDOW:, LEADING:, TRAILING:, CROP:, HEADCROP:, MINLEN:

Each line will represent a fastq file.

Required fields: ends, input_forward. Make sure those fields are filled in the csv file \ trimmomatic requires output files as well, but those are given automatically in the script. Don't include them in the csv.

Optional fields: Do not remove unwanted fields, just leave them empty and it will be ignored.

NOTE: When adding new arguments to csv, add the argument as it will appear in the Trimmomatic command, including symbols like -, :, etc.

For more info about Trimmomatic's arguments, see Trimmomatic documentations: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf, http://www.usadellab.org/cms/?page=trimmomatic

Additional Info: pipeline.sh

Usage:

bash pipeline.sh [-h | -r [REFSEQ_PATH] [-i WD] [-o OUTPUT_PATH]

Run pipeline on samples to get consensus sequences and additional data:

-i /input/path : provide input path (required)

/input/path - the path to fastq files.

-r|--refseq <refseq/path/> : provide refseq path (required)

Don't worry about indexing the fasta file, it happens automatically.

--spike (optional)

run the pipeline for hi-spike sequencing. \ bash pipeline.sh -r path/to/spike_ref.fa -i path/to/fastq/location -o path/for/output/reports --spike

-n|--newNextclade (optional)

use nextclade version 1.3.0. \ bash pipeline.sh -r path/to/ref.fa -i path/to/fastq/location -o path/for/output/reports --newNextclade

-p|--processes (optional)

number of processes (samples) to run in parallel. default: 1. \ bash pipeline.sh -r path/to/ref.fa -i path/to/fastq/location -o path/for/output/reports -p 5

-t|--threads (optional)

number of threads for each sample. default: 32. \ bash pipeline.sh -r path/to/ref.fa -i path/to/fastq/location -o path/for/output/reports --threads 32

If you are still confused, here are some examples

To get the usage menu: \ bash pipeline.sh -h OR bash pipeline.sh --help

Results:

The consensus sequences are found in CNS/ and CNS_5/, and further info in results/

Pipeline's Basic Steps

conda activate CoronaPipeline

  1. Index the reference sequence. \ bwa index /refseq/path

  2. For each fastq sample - map to reference and keep as bam file. \ For each fastq file run: \ bwa mem /refseq/path sample_R1 sample_R2 | samtools view -b - > BAM/sample_name.bam \ R1 and R2: forward and reverse paired ends.

  3. Keep only mapped reads of each sample. \ For each bam file run: \ samtools view -b -F 260 file_name.bam > BAM/file_name.mapped.bam

  4. Sort and index each sample in BAM. \ For each mapped bam file run: samtools sort BAM/file_name.mapped.bam BAM/file_name.mapped.sorted.bam

  5. Create consensus files for each sample. \ For each mapped and sorted bam file run: samtools mpileup -A BAM/file_name.mapped.sorted.bam | ivar consensus -m 5 -p CNS5/file_name

  6. Align all consensus sequences and references sequence: \ Gather all fasta sequences (consensus + reference): \ cat CNS5/*.fasta /refseq/path > alignment/all_not_aligned.fasta \ Align with augur (mafft based) and save output in alignment/ directory: \ augur align \ --sequences alignment/all_not_aligned.fasta \ --reference-sequence /refseq/path \ --output alignment/all_aligned.fasta

  7. Classify mutations with pangolin: \ conda deactivate \ conda activate pangolin \ pangolin alignment/all_not_aligned.fasta --outfile results/pangolinClades.csv \ conda deactivate

  8. Check for mutations and variants: \ conda activate CoronaPipeline \ python MutTable.py alignment/all_aligned.fasta results/nuc_muttable.xlsx \ python translated_table.py alignment/all_aligned.fasta results/AA_muttable.xlsx regions.csv \ python variants.py alignment/all_aligned.fasta results/variants.csv results/pangolinClades.csv

  9. Report.txt: \ some coverage statistics and number of mapped reads: samtools coverage -H BAM/sample_name.mapped.sorted.bam \ total number of reads in sample: samtools view -c BAM/sample_name.bam \ depth of each position in sample: samtools depth BAM/sample_name.mapped.sorted.bam \ Check out pipeline.sh code for specifics.

Minipipeline

Script for analysis of fasta files (instead of fastq)

As in the regular pipeline, make sure to set the CoronaPipeline conda environment. No need to actiavte it before the run. \

Usage:

bash /path/to/minipipeline.sh -i fasta_path.fa -r covid_refseq.fasta

--dontAlign : run the minipipeline with aligned fasta input. (optional)

The first sequence in the alignment must be the reference sequence \ bash minipipeline.sh -i alignment.fasta --dontAlign

--invr : generate INVR report containing information about the mutations. (optional) \

find the output in results/invr.csv .

bash minipipeline.sh -i alignment.fasta -r ref.fasta --invr