DormanLab / AmpliCI

AmpliCI, a model-based algorithm for denoising Illumina amplicon data.
BSD 3-Clause "New" or "Revised" License
20 stars 7 forks source link

AmpliCI

AmpliCI, Amplicon Clustering Inference, denoises Illumina amplicon data by approximate model-based clustering.

AmpliCI v2.0 now incorporates our new Unique Molecular Identifier (UMI)-aware software DAUMI to denoise UMI-tagged Illumina amplicon sequences. This README focuses on installation of AmpliCI and how to use it to estimate haplotypes and abundance for amplicons without UMIs. See the DAUMI instructions for information on how to estimate haplotypes and abundance for amplicons with UMIs. The development version of DAUMI is at this page.

Table of Contents

  1. Prerequisites
  2. Installation
  3. Preparing input
    1. Demultiplexing
    2. Quality control and read processing
    3. Input files
  4. Usage
  5. Output
  6. Downstream analysis
  7. Troubleshooting
  8. Detailed options
  9. C library
  10. Acknowledgements
  11. Citation
  12. Contact

Prerequisites

Installation

AmpliCI has been tested under Linux and MacOS.

  1. Clone the repository.
    git clone https://github.com/DormanLab/AmpliCI.git
  2. Configure the project.
    cd AmpliCI/src
    cmake .
  3. Compile AmpliCI. The executable is called run_AmpliCI. It will appear in the src directory you are currently in.
    make

Preparing input

The input of AmpliCI is a FASTQ file, but there is some necessary preprocessing.

Demultiplexing

Like all other denoising methods, the starting point of the analysis is FASTQ sequence data after demultiplexing. If you start with separate barcode and read FASTQ files, you can use the qiime script split_libraries_fastq.py for demultiplexing. Use the option --store_demultiplexed_fastq to keep demultiplexed fastq files.

Quality control and read preprocessing

AmpliCI requires all input reads have the same length, with no ambiguous nucleotides (only A, C, G, T base calls allowed). One way to truncate or filter reads with ambiguous nucleotides is via the R package ShortRead or simply use seqkit with following command,

seqkit grep -srv -p 'N' reads1.fastq > reads1_noN.fastq

Input files

AmpliCI takes a single demultiplexed FASTQ file (one per sample) generated from the Illumina sequencing platform, with reads trimmed to the same length and containing no ambiguous nucleotides (see above steps). If you have paired end data, AmpliCI can analyze the forward reads, the reverse reads, or the merged reads, but not both forward and reverse reads simultaneously.

You can find example input FASTQ files in the test directory.

One read in the input FASTQ file should fit in exactly four lines, as in the format below.

@SRR2990088.351
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTTTTAAGTCAGCGGTGAAAGTCTGTGGCTCAACCATAGAATTGCCGTTGAAACTGGGAGGCTTGAGTATGTTTGAGGCAGGCGGAATGCGTGGTGTAGCGGTGAAATGCGTAGATATCAAGCAGAACACCGATTGCGAAGGCAGCTTGCTAAGCCATGACTGACGCTGATGCACGAAAGCGTGGGGATGAAACA
+
CCCCCGGGG8CFCFGGEGGGGGGGGGB@FFEEFFGFCFFFGGGGGGGEFGGG9@@F@FF9EFFG<EEGD@EFFGGGG,ECBCEFGCAFEFEEF<E?FEFFG<F@FFFGGG9FG@FGGG8DEGGGD,A=4,AEDF+F3BCCEEE7DFCGEEDEFEGFEGEGE<@@F>*:?BB7@;,>,5,*;CC:,4C957*:AB5<=DF6:>/*5*121/(/*500.<<;52(444+164-83::>:B021;91-(.<6).

First line: @sequence name

Second line: DNA sequence (A, T, C, G)

Third line: +any content on a single line

Fourth line: quality score sequence (ASCII [!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ])

If your read or quality scores are split over multiple lines, AmpliCI will not work. One possible script for fixing your FASTQ-formatted files is given by Damian Kao on BioStars. You can also use seqkit with following command,

seqkit seq reads_1.fq -w 0

Usage

AmpliCI runs in two major steps:

  1. Use AmpliCI to estimate the error profile directly from the data (the executable is called run_AmpliCI):
    ./run_AmpliCI --fastq <input_fastq_file> --outfile <output_error_profile_file> --error

    An example (from the src directory):

    ./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/error.out  --error
  2. Use Amplici to estimate the haplotypes and their abundance using the estimated error profile:

    ./run_AmpliCI --fastq <input_fastq_file> --outfile <output_base_filename> --abundance 2 --profile <input_error_profile_file>

    An example (from the src directory):

    ./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/test --abundance 2 --profile ../test/error.out

    If you provide no input error profile with the --profile option, AmpliCI will assume the error rates are the error rates dictated by Phred quality scores. Assuming Phred quality scores is not a good idea. Using Phred quality scores tends to generate high numbers of false positives and runs very slowly.

    • You can also use AmpliCI to reassign reads given input haplotypes. You can provide your haplotype set with '--haplotypes' option:
      ./run_AmpliCI --fastq <input_fastq_file> --outfile <output_assignment_filename> --profile <input_error_profile_file> --haplotypes <input_haplotypes_fasta_file>

      An example (from the src directory):

      ./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/test.id --profile ../test/error.out --haplotypes ../test/test.fa
    • Detailed help can be obtained with:
      ./run_AmpliCI --help
    • If you apply AmpliCI on longer reads with length > 300 (like merged reads), you may want to decrease the default Lower bound for screening reads during cluster assignment with --log_likelihood [DEFAULT: -100.000000]. For example, you can set the lower bound at -200.
      ./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/test.id --profile ../test/error.out --haplotypes ../test/test.fa --log_likelihood -200

Output Files

Estimating error profile.

When run to estimate the error profile, AmpliCI will output an error profile <output_error_profile_file> in text format. This is simply a list of comma-separated probabilities (times 1000) of the probability that haplotype nucleotide n is misread as read nucleotide m with quality score q. They are ordered as (n,m,q), with the last index varying the fastest. Both haplotype nucleotide n and read nucleotide m are in the order (A,C,T,G) , and q has the range from 0 to 40 (41 in total). For example, the first 41 entries are estimated transition probabilities for A->A when observed quality score q is in [0:40]; Then the 42nd - 82nd entries are estimated transition probabilities for A->C; the 165th - 205th entries are estimated transition probabilities for C->A.... In our recommended workflow the error profile should only be used with the dataset from which it was estimated. If you apply AmpliCI estimates to other datasets or use other estimates with AmpliCI, you should consider the following:

Estimating haplotypes.

When run to estimate haplotypes and their abundances with argument --outfile <output_base_filename> or --outfile <fasta_output_file> <information_output_file>, there will be two output files:

1.output_base_filename.fa or fasta_output_file

FASTA-formatted file (will be used in the downstream analysis) containing denoised sequences (or haplotypes). For each sequence, we also provide size (scaled true abundance), DiagP (diagnostic probability), ee (mean expected number of errors in reads), useful for chimera detection and post hoc filtering. For example for the first haplotype, the FASTA header might look like:

>H0;size=516.000;DiagP=0.00e+00;ee=0.405;

2.output_base_filename.out or information_output_file

A text file with the following information provided as key: value pairs, one per line. The keys are:

When run with option --haplotypes to reassign reads to the user-provided haplotype set (a FASTA-formatted file), AmpliCI will output a read assignment file <output_assignment_filename> in text format. The keys are

Downstream Analysis

The output FASTA file contains denoised raw haplotype sequences, which may include chimeric sequences. The first step of any downstream analysis should be to remove chimeric sequences.

Chimera Detection

The AmpliCI-outputted FASTA file is in acceptable format to input into the uchime3_denovo method implemented in usearch.

Haplotype sorting by abundance

./usearch -sortbysize <input_fasta_file> -fastaout <output_sorted_fasta_file>

Chimera detection

./usearch -uchime3_denovo <input_sorted_fasta_file> -uchimeout <uchime_outfile> -chimeras <chimera_fasta_outfile> -nonchimeras <nonchimera_fasta_outfile>

You may also use other chimera detection algorithms to remove chimeras.

Generate Amplicon Sequence Variant (ASV or sOTU) Table

We have provided an R script to help to generate the ASV (sOTU) table, where scaled true abundances (see size) per sample per ASVs/sOTUs are reported.

Taxa Assignment

You may want to identify the non-chimeric haplotypes detected in your sample. There are multiple methods.

  1. DECIPHER contains IDTAXA, a novel approach for taxonomic classification.

  2. RDP classifier, a Naive Bayesian Classifier.

Futher Analysis

mothur, qiime2, LefSE, phyloseq, ....

Troubleshooting

The algorithm may stop if your:

  • quality scores are not in the typical range for Illumina datasets [33,73]

  • reads contain ambiguous nucleotides

  • reads are not in the right FASTQ input format, for example reads and quality scores cannot contain newline characters

  • there are too few reads or reads are so noisy that there are no sequences observed more than the lower bound number of times (option --abundance, default 2.0)

Detailed options

Main options:

  • --fastq The fastq input file. [REQUIRED]

  • --outfile Output file(s) for haplotype discovery, estimated error profile (when used with --error), or cluster assignments (when used with --haplotypes). [REQUIRED]

  • --profile The input error profile. If none, convert quality score to Phred error probability. [DEFAULT: none]

  • --error Estimate the error profile. [Used in error estimation only]

  • --haplotypes FASTA file with haplotypes. [Used in reads assignment only]

Options for sensitivity:

  • --abundance Lower bound for scaled true abundance during haplotype reconstruction (should be >= 2.0). [DEFAULT: 2.0]

  • --contaminants Baseline count abundance of contaminating or noise sequences. [DEFAULT: 1]

  • --indel Indel sequencing error rate. Cannot also use options --insertion or --deletion. [DEFAULT: 0.00006]

  • --diagnostic Threshold of diagnostic probability in the diagnostic/contamination test. [DEFAULT: 0.001 / number_candidates]

Other important options:

  • --align Align all reads to haplotypes (slow). [DEFAULT: none]

  • --log_likelihood Lower bound for screening reads during cluster assignment. This is the minimum log assignment likelihood, $\ln \pi_k + \ln \Pr(r_i|h_k)$. [DEFAULT: -100.000000]

  • --nJC69 Disable JC69 model. By default, AmpliCI assume all sequences are generated from an ancestral sequence, which slightly increases the sensitivity for detecting closed haplotypes. [Use the option when biological sequences are unrelated]

C library

AmpliCI provides both a shared and a static C library for users to call function amplici_wfile() to cluster amplicon sequences from another program. The library libamplici.* (libamplici.so for Linux or libamplici.dylib for MacOS, and libamplici.a) will appear in the src directory when you compile AmpliCI.

Input

  • fastq_file: The fastq input file. [REQUIRED]

  • error_profile_name: The input error profile. If NULL, convert quality score to Phred error probability.

  • low_bound: Allowed lowest abundance. See the description of option --abundance. [REQUIRED]

Output

  • seeds: Estimated haplotypes.

  • seeds_length: Lengths of Estimated haplotypes.

  • cluster_id: See the description of assignments above for outfile output_base_filename.out. Note amplici_wfile() does not filter reads with maximal conditional log likelihood under the given threshold. Instead, it assigns all reads to its closest haplotypes with the maximum likelihood.

  • cluster sizes: Number of reads assigned to each haplotype.

  • K: Number of estimated haplotypes.

  • sample_size: Number of reads in the fastq input file

  • max_read_length: Maximum read length l. The kth (in [0,1,2,...K-1]) haplotype starts at seeds[k*l].

  • abun: See the description of scaled true abun above for outfile output_base_filename.out.

  • ll: See the description of reads ll above for outfile output_base_filename.out.

An example to call function amplici_wfile() is provided in example_wfile.c. You can compile the source file with the C library libamplici.a (in the src directory):

gcc -o myprog example_wfile.c -lamplici -lRmath -lm -I ./src/ -L ./src/

Use -I to provide path to header file of the library libamplici.h and -L to provide path to the library libamplici.a. You may need to add additional path to Rmath library and header files if needed. Note example_wfile.c needs two more header files in the src directory, which are not required by the library libamplici.a.

If you use the shared library (libamplici.so or libamplici.dylib), you need to add the PATH to the shared library before running your executable file.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/your/full/path/to/library

Acknowledgments

Citation

Contact

If you have any problems with AmpliCI, please contact:

Xiyu Peng (pansypeng124@gmail.com)

Karin Dorman (kdorman@iastate.edu)