IRI-UW-Bioinformatics / flu-ngs

NGS pipeline for influenza virus libraries
MIT License
2 stars 0 forks source link
flu influenza ngs ngs-analysis ngs-pipeline pipeline snakemake snakemake-workflow

flu-ngs

Influenza virus next generation sequence analysis pipeline.

Highlights:

Requirements

snakemake

The 'pipeline' is really two snakemake workflows: preprocess.smk, which trims reads and runs quality control measures, and irma.smk which runs IRMA and generates summary output.

Bioinformatics

These workflows call various other bioinformatics programs:

Versions are listed in workflow/envs/*.yaml.

Python

There are several python scripts in workflow/scripts which have a couple of dependencies.

Make a python virtual environment and install the requirements with:

pip install -r requirements.txt

Read more about what python virtual environments are, why they are useful, and how to set them up here.

You could use the same virtual environment for each analysis. If you have one setup, then activate it with:

source ~/.virtualenvs/flu-ngs-env/bin/activate

Running the workflow

Each time you have samples to run, I would suggest cloning this repository:

git clone git@github.com:IRI-UW-Bioinformatics/flu-ngs.git <name>

where <name> is the name of the directory that you want, then cd <name>.

Data

The next step is to put the read files in a structure expected by the workflow.

MiSeq

Put reads in a directory called raw with the following structure:

raw/
├── trimlog.fas
├── YK_2832
│   ├── YK_2832_1.fastq
│   └── YK_2832_2.fastq
├── YK_2833
│   ├── YK_2833_1.fastq
│   └── YK_2833_2.fastq
├── YK_2834
│   ├── YK_2834_1.fastq
│   └── YK_2834_2.fastq
...

It is fine if the fastq files are gzipped (i.e. have a .gz suffix).

Forward and reverse reads should be in {sample}_1.fastq and {sample}_2.fastq respectively.

trimlog.fas should contain the adapters.

MinION

Put reads in a directory called raw. They must be gzipped (end with fastq.gz).

raw/
├── barcode05
│   ├── FAW31148_pass_barcode05_485f6488_e81f1340_820.fastq.gz
│   ├── FAW31148_pass_barcode05_485f6488_e81f1340_821.fastq.gz
...
├── barcode06
│   ├── FAW31148_pass_barcode06_11f103b9_993a7465_30.fastq.gz
│   ├── FAW31148_pass_barcode06_11f103b9_993a7465_31.fastq.gz
...

The names of subdirectories (barcode05 and barcode06) define the names of samples. All .fastq.gz files in each subdirectory are assigned to that sample name.

Configuration

Run parameters are passed to the workflow by a file called config.json that should have these keys:

MiSeq example:

{
  "platform": "miseq",
  "samples": [
    "YK_2837",
    "YK_2970"
  ],
  "pair": [
    "combined"
  ],
  "order": [
    "primary",
    "secondary"
  ],
  "errors": "warn"
}

MinION example:

{
  "platform": "minion",
  "samples": [
    "barcode05",
    "barcode06"
  ],
  "pair": [
    "longread"
  ],
  "order": [
    "primary",
  ],
  "errors": "warn"
}

Preprocessing

Reads are first preprocessed. For MiSeq data adapters are trimmed and quality control reports are generated. For MinION data reads are filtered by a min and max length. See workflow/rules/preprocess-{minion,miseq}.smk for details.

snakemake -s workflow/preprocess.smk -c all

-c all tells snakemake to use all available cores, scale this back if need be. HTML QC reports for MinION data are saved in results/qc-raw and results/qc-trimmed.

Variant calling

Run IRMA and make summary reports of the output:

snakemake -s workflow/irma.smk -c all

Variant summary reports

Three summary files are generated:

Sequences

IRMA consensus sequences and amino acid translations are put in results/<order>/seq.

Splice variants

Splice variants of MP, PA and PB1 are all based on the assumption that IRMA finds canonical length consensus sequences for these segments (see here for more details).

If IRMA finds a consensus sequence for one of these segments that is not the expected length, then the behaviour is determined by the config file:

(For NS, we know to expect more variability in segment length, and the locations of exons are flexibly determined.)

Bonus: sorted BAM files

Most software to look at reads alignments require sorted bam files, and/or bam index files. I've written a small workflow for generating these for all bam files IRMA generates. It requires samtools. .sorted.bam and .sorted.bam.bai files are saved in the same directory as the original .bam files. Do:

snakemake -s workflow/sort-bam.smk -c all