dada2 container for CyVerse VICE

rbartelme commented 4 years ago

Use Case: Container for processing amplicon data from raw paired illumina fastq format to merged/clustered amplicon data

Based on Mike Lee's workflow.

[x] Build off quay.io dada2 R image, picking a static version
[x] add cutadapt installs for sequence primer trimming
[x] add installs for R library decontam
[x] add VICE injection script calls & iRODS setup
[x] local build test
[x] push to CyVerse VICE Dockerhub
[x] make VICE app
[x] test on VICE
[x] E-mail Lee Stanish about sequencing blanks in the database

rbartelme commented 4 years ago

added R libraries vegan, zetadiv, deblur to docker file as well.

rbartelme commented 4 years ago

dada2 biocontainers base image does not support apt-get, so I'll switch to using the rocker Rstudio tidyverse base image. And fall deep into the C dependencies rabbit hole.

rbartelme commented 4 years ago

Having issues with R package versioning vs. Rlang vs. container OS.

rbartelme commented 4 years ago

deblur and decontam are not available for R 3.6.1

rbartelme commented 4 years ago

Switching base container to rocker/verse:4.0.0-ubuntu18.04

rbartelme commented 4 years ago

Need to troubleshoot cutadapt installs, apt install -y cutadapt gives version 1.19 not version 2.1. Inside the Rstudio environment terminal version 1.9 does not respond to the --help flag properly, and throws an error.

I'll try installing pip3 and then using pip to install cutadapt...this should hopefully resolve the versioning and CLI.

I should also move the R installs above this in the Dockerfile to speed up the cached installs while experimenting with the ideal cutadapt install.

rbartelme commented 4 years ago

Help works with cutadapt, but is still deprecated version (without multithreading/compression).


cutadapt version 1.15
Copyright (C) 2010-2017 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

Citation:

Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200

Use "cutadapt --help" to see all command-line options.
See http://cutadapt.readthedocs.io/ for full documentation.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --debug               Print debugging information.
  -f FORMAT, --format=FORMAT
                        Input file format; can be either 'fasta', 'fastq' or
                        'sra-fastq'. Ignored when reading csfasta/qual files.
                        Default: auto-detect from file name extension.
  -j CORES, --cores=CORES
                        Number of CPU cores to use. Default: 1

  Finding adapters:
    Parameters -a, -g, -b specify adapters to be removed from each read
    (or from the first read in a pair if data is paired). If specified
    multiple times, only the best matching adapter is trimmed (but see the
    --times option). When the special notation 'file:FILE' is used,
    adapter sequences are read from the given FASTA file.

    -a ADAPTER, --adapter=ADAPTER
                        Sequence of an adapter ligated to the 3' end (paired
                        data: of the first read). The adapter and subsequent
                        bases are trimmed. If a '$' character is appended
                        ('anchoring'), the adapter is only found if it is a
                        suffix of the read.
    -g ADAPTER, --front=ADAPTER
                        Sequence of an adapter ligated to the 5' end (paired
                        data: of the first read). The adapter and any
                        preceding bases are trimmed. Partial matches at the 5'
                        end are allowed. If a '^' character is prepended
                        ('anchoring'), the adapter is only found if it is a
                        prefix of the read.
    -b ADAPTER, --anywhere=ADAPTER
                        Sequence of an adapter that may be ligated to the 5'
                        or 3' end (paired data: of the first read). Both types
                        of matches as described under -a und -g are allowed.
                        If the first base of the read is part of the match,
                        the behavior is as with -g, otherwise as with -a. This
                        option is mostly for rescuing failed library
                        preparations - do not use if you know which end your
                        adapter was ligated to!
    -e ERROR_RATE, --error-rate=ERROR_RATE
                        Maximum allowed error rate (no. of errors divided by
                        the length of the matching region). Default: 0.1
    --no-indels         Allow only mismatches in alignments. Default: allow
                        both mismatches and indels
    -n COUNT, --times=COUNT
                        Remove up to COUNT adapters from each read. Default: 1
    -O MINLENGTH, --overlap=MINLENGTH
                        Require MINLENGTH overlap between read and adapter for
                        an adapter to be found. Default: 3
    --match-read-wildcards
                        Interpret IUPAC wildcards in reads. Default: False
    -N, --no-match-adapter-wildcards
                        Do not interpret IUPAC wildcards in adapters.
    --no-trim           Match and redirect reads to output/untrimmed-output as
                        usual, but do not remove adapters.
    --mask-adapter      Mask adapters with 'N' characters instead of trimming
                        them.

  Additional read modifications:
    -u LENGTH, --cut=LENGTH
                        Remove bases from each read (first read only if
                        paired). If LENGTH is positive, remove bases from the
                        beginning. If LENGTH is negative, remove bases from
                        the end. Can be used twice if LENGTHs have different
                        signs. This is applied *before* adapter trimming.
    --nextseq-trim=3'CUTOFF
                        NextSeq-specific quality trimming (each read). Trims
                        also dark cycles appearing as high-quality G bases.
    -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF
                        Trim low-quality bases from 5' and/or 3' ends of each
                        read before adapter removal. Applied to both reads if
                        data is paired. If one value is given, only the 3' end
                        is trimmed. If two comma-separated cutoffs are given,
                        the 5' end is trimmed with the first cutoff, the 3'
                        end with the second.
    --quality-base=QUALITY_BASE
                        Assume that quality values in FASTQ are encoded as
                        ascii(quality + QUALITY_BASE). This needs to be set to
                        64 for some old Illumina FASTQ files. Default: 33
    -l LENGTH, --length=LENGTH
                        Shorten reads to LENGTH. This and the following
                        modifications are applied after adapter trimming.
    --trim-n            Trim N's on ends of reads.
    --length-tag=TAG    Search for TAG followed by a decimal number in the
                        description field of the read. Replace the decimal
                        number with the correct length of the trimmed read.
                        For example, use --length-tag 'length=' to correct
                        fields like 'length=123'.
    --strip-suffix=STRIP_SUFFIX
                        Remove this suffix from read names if present. Can be
                        given multiple times.
    -x PREFIX, --prefix=PREFIX
                        Add this prefix to read names. Use {name} to insert
                        the name of the matching adapter.
    -y SUFFIX, --suffix=SUFFIX
                        Add this suffix to read names; can also include {name}

  Filtering of processed reads:
    Filters are applied after above read modifications. Paired-end reads
    are always discarded pairwise (see also --pair-filter).

    -m LENGTH, --minimum-length=LENGTH
                        Discard reads shorter than LENGTH. Default: 0
    -M LENGTH, --maximum-length=LENGTH
                        Discard reads longer than LENGTH. Default: no limit
    --max-n=COUNT       Discard reads with more than COUNT 'N' bases. If COUNT
                        is a number between 0 and 1, it is interpreted as a
                        fraction of the read length.
    --discard-trimmed, --discard
                        Discard reads that contain an adapter. Also use -O to
                        avoid discarding too many randomly matching reads!
    --discard-untrimmed, --trimmed-only
                        Discard reads that do not contain an adapter.

  Output:
    --quiet             Print only error messages.
    -o FILE, --output=FILE
                        Write trimmed reads to FILE. FASTQ or FASTA format is
                        chosen depending on input. The summary report is sent
                        to standard output. Use '{name}' in FILE to
                        demultiplex reads into multiple files. Default: write
                        to standard output
    --info-file=FILE    Write information about each read and its adapter
                        matches into FILE. See the documentation for the file
                        format.
    -r FILE, --rest-file=FILE
                        When the adapter matches in the middle of a read,
                        write the rest (after the adapter) to FILE.
    --wildcard-file=FILE
                        When the adapter has N wildcard bases, write adapter
                        bases matching wildcard positions to FILE. (Inaccurate
                        with indels.)
    --too-short-output=FILE
                        Write reads that are too short (according to length
                        specified by -m) to FILE. Default: discard reads
    --too-long-output=FILE
                        Write reads that are too long (according to length
                        specified by -M) to FILE. Default: discard reads
    --untrimmed-output=FILE
                        Write reads that do not contain any adapter to FILE.
                        Default: output to same file as trimmed reads

  Colorspace options:
    -c, --colorspace    Enable colorspace mode
    -d, --double-encode
                        Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).
    -t, --trim-primer   Trim primer base and the first color
    --strip-f3          Strip the _F3 suffix of read names
    --maq, --bwa        MAQ- and BWA-compatible colorspace output. This
                        enables -c, -d, -t, --strip-f3 and -y '/1'.
    -z, --zero-cap      Change negative quality values to zero. Enabled by
                        default in colorspace mode since many tools have
                        problems with negative qualities
    --no-zero-cap       Disable zero capping

  Paired-end options:
    The -A/-G/-B/-U options work like their -a/-b/-g/-u counterparts, but
    are applied to the second read in each pair.

    -A ADAPTER          3' adapter to be removed from second read in a pair.
    -G ADAPTER          5' adapter to be removed from second read in a pair.
    -B ADAPTER          5'/3 adapter to be removed from second read in a pair.
    -U LENGTH           Remove LENGTH bases from second read in a pair (see
                        --cut).
    -p FILE, --paired-output=FILE
                        Write second read in a pair to FILE.
    --pair-filter=(any|both)
                        Which of the reads in a paired-end read have to match
                        the filtering criterion in order for the pair to be
                        filtered. Default: any
    --interleaved       Read and write interleaved paired-end reads.
    --untrimmed-paired-output=FILE
                        Write second read in a pair to this FILE when no
                        adapter was found in the first read. Use this option
                        together with --untrimmed-output when trimming paired-
                        end reads. Default: output to same file as trimmed
                        reads
    --too-short-paired-output=FILE
                        Write second read in a pair to this file if pair is
                        too short. Use together with --too-short-output.
    --too-long-paired-output=FILE
                        Write second read in a pair to this file if pair is
                        too long. Use together with --too-long-output.```

genophenoenvo / docker

dada2 container for CyVerse VICE #37

Use Case: Container for processing amplicon data from raw paired illumina fastq format to merged/clustered amplicon data