MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

ShortStack

Alignment of small RNA-seq data and annotation of small RNA-producing genes

install with bioconda Conda

Author

Michael J. Axtell, Penn State University, mja18@psu.edu

Table of Contents

Citations

If you use ShortStack in support of your work, please cite one or more of the following:

Installation

You can either use the conda package manager to install from the bioconda channel, or manually set up an environment. Use of conda/bioconda is highly recommended!

Install using conda (recommended)

First, install conda, and then set it up to use the bioconda channel following the instructions at https://bioconda.github.io

Then, follow instructions below based on your system to install the dependencies to a new environment and activate the environment.

Linux or Intel-Mac

conda create --name ShortStack4 shortstack 
conda activate ShortStack4

Silicon-Mac

Some dependencies have not been compiled for the newer Silicon-based Macs on bioconda, so we need to force conda to install the osx-64 (Intel) versions instead. Silicon Macs can run Intel code using built-in Rosetta translation.

conda create --name ShortStack4
conda activate ShortStack4
conda config --env --set subdir osx-64
conda install shortstack

Manual installation

Create an environment that contains the following packages / tools compiled and installed:

Then, download the ShortStack script from this github repo. Make it executable chmod +x ShortStack and then copy it into your environment's PATH.

Usage

ShortStack [-h] [--version] --genomefile GENOMEFILE [--known_miRNAs KNOWN_MIRNAS]
    (--readfile [READFILE ...] | --bamfile [BAMFILE ...]) [--outdir OUTDIR] [--adapter ADAPTER | --autotrim]
    [--autotrim_key AUTOTRIM_KEY] [--threads THREADS] [--mmap {u,f,r}] [--align_only] [--show_secondaries]
    [--dicermin DICERMIN] [--dicermax DICERMAX] [--locifile LOCIFILE | --locus LOCUS] [--nohp] [--dn_mirna]
    [--strand_cutoff STRAND_CUTOFF] [--mincov MINCOV] [--pad PAD] [--no_bigwigs]

Required

Recommended

Other options

Resources

Memory

Peak memory (RSS) primarily scales with genome size, and especially with the size of the largest chromosomes in the genome assembly. Generally between 4-10GB memory per thread is more than enough, but very large genomes should be monitored for memory overruns and requested memory adjusted accordingly. Because of the multi-threading, it is important to scale available memory with the number of threads being used.

Disk

During the alignment phase ShortStack will potentially write many large, but temporary, files to disk. These can easily exceed 100GB disk space especially when there are a lot of multi-mapping reads. Ensure that the location of --outdir has ample free disk space (plan on 200GB minimum to be safe).

CPU and running time

All compute-intensive parts of ShortStack are now multi-threaded. Providing more threads via --threads generally will decrease run-times in near-linear fashion. Be sure to scale memory with thread use though .. 4-10GB RAM per thread, depending on genome size, seems usually sufficient.

Read alignment is often the most time-consuming portion of the analysis. MIRNA identification (triggered by --known_miRNAs and/or --dn_mirna) is also time-consuming. Larger genomes generally run slower compared to smaller genomes. Highly fragmented genome assemblies (e.g. very high numbers of chromosomes/scaffolds) can be particularly slow because of the index lookup costs associated with thousands of entries. Consider obtaining and using better genome assemblies, or removing very short scaffolds from highly fragmented genome assemblies.

Testing and Examples

Gather Test Data

Reference Genome

For testing we will use the Arabidopsis thaliana TAIR10 genome assembly, including plastid and mitochondrial genomes. The Axtell Lab website is serving a copy of this that is easy to obtain:

curl https://plantsmallrnagenes.science.psu.edu/ath-b10/Arabidopsis_thalianaTAIR10.fa -ko Arabidopsis_thalianaTAIR10.fa

Raw sRNA-seq Reads

We can obtain some example A. thaliana sRNA-seq data from SRA using the SRA toolkit. If you don't already have the sra toolkit installed you can obtain it at https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

After installing the sra toolkit you should configure it .. see https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration

tip I like to configure sra tools to prefetch and download to the working directory. That way I can remember to get rid of the pre-fetched directories when I have dumped the FASTQ.

Then you can use the standard prefetch and fasterq-dump method to retrieve the FASTQ files.

prefetch SRR3222443 SRR3222444 SRR3222445
fasterq-dump SRR3222443 SRR3222444 SRR3222445

You will now have 3 .fastq files of raw (untrimmed) sRNA-seq reads. These data are derived from Col-0 Arabidopsis thaliana immature inflorescence tissues (see Wang et al. 2017 https://doi.org/10.1111/tpj.13463)

Alternative using SRA run browser

Sometimes sra-tools is a pain to use. Alternatively, you can use the web interface at https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&display=metadata to grab the example data, searching by SRR accession number, and downloading FASTQ.

Known miRNAs

To get a list of known miRNAs, we will use all miRBase annotated mature miRNAs from miRBase. First, download the mature.fa file from miRBase at https://mirbase.org/download/. Then filter it to get only the ath ones (e.g. the ones from A. thaliana).

grep -A 1 '>ath' mature.fa | grep -v '\-\-' > ath_known_miRNAs.fasta

Example Run

This example is a full run. It takes 3 raw (untrimmed) readfiles, identifies the adapters, trims the reads, indexes the genome, aligns the reads, discovers small RNA loci, and annotates high-confidence MIRNA loci. The example uses 6 threads; this can be adjusted up or down depending on your system's configuration; more threads decrease execution time but the response is non-linear (diminishing returns with very high thread numbers). The examples uses the test data described above.

ShortStack --genomefile Arabidopsis_thalianaTAIR10.fa --readfile SRR3222443.fastq SRR3222444.fastq SRR3222445.fastq --autotrim --threads 6 --outdir ExampleShortStackRun --known_miRNAs ath_known_miRNAs.fasta

On my laptop this completes in about 18 minutes. All results are in the directory specified by --outdir, "ExampleShortStackRun". The outputs are described in the section below called "Outputs".

Outputs

Results.txt

A tab-delimited text file giving key information for all small RNA clusters. Columns:

  1. Locus: Coordinates of the locus in Chrom:Start-Stop format, one-based, inclusive.
  2. Name: Name for the Locus. De novo loci are named like Cluster_1, etc. Loci provided by the user with option --locifile preserve the user-given names.
  3. Chrom: Name of the chromosome
  4. Start: One-based start position (left-most) of the locus, inclusive.
  5. End: One-based end position (right-most) of the locus, inclusive.
  6. Length: Length of the locus (base-pairs)
  7. Reads: Number of aligned sRNA-seq reads that overlap this locus.
  8. DistinctSequences: Number of distinct sRNA sequences that overlap this locus. A single DistinctSequence can have one or more reads.
  9. FracTop: Fraction of Reads aligned to the top genomic strand.
  10. Strand: Inferred strandednes of the locus, inferred from FracTop and the --strand_cutoff setting.
  11. MajorRNA: Sequence of the single most abundant RNA sequence at the locus.
  12. MajorRNAReads: Number of reads for the MajorRNA aligned to this locus.
  13. Short: Number of reads with a size less than --dicermin aligned to the locus. By default these are reads < 21 nucleotides in length.
  14. Long: Number of reads with a size greater than --dicermax aligned to the locus. By default these are reads > 24 nucleotides in length.
  15. 21: Number of 21 nucleotide reads aligned to the locus.
  16. 22: Number of 22 nucleotide reads aligned to the locus.
  17. 23: Number of 23 nucleotide reads aligned to the locus.
  18. 24: Number of 24 nucleotide reads aligned to the locus.
  19. DicerCall: If >= 80% of all aligned reads are within the boundaries of --dicermin and --dicermax, than the DicerCall gives the size of most abundant small RNA size. If < 80% of the aligned reads are in the --dicermin and --dicermax boundaries, DicerCall is set to 'N'. Loci with a DicerCall of 'N' are unlikely to be small RNAs related to the Dicer-Like/Argonaute system of gene regulation.
  20. MIRNA: Did the locus pass all criteria to be called a MIRNA locus? If so, 'Y'. If not, 'N'.
  21. Known_miRNAs: Semicolon delimited list of user-provided known RNAs that aligned to the locus. If none, 'NA'.

Counts.txt

A tab-delimited text file giving the raw alignment counts for each locus in each separate sample. Only produced if there was more than one sRNA-seq file used to create alignments. This file is useful for downstream analyses, especially differential expression analysis.

alignment_details.tsv

A tab-delimited text file that gives details about small RNA-seq alignments as a function of mapping type, read length, and sample. This can be useful for plotting purposes and subsequent quality control of small RNA-seq data. It is only produced if alignments are performed.

Results.gff3

Small RNA loci in the gff3 format. Suitable for use on genome browsers. For loci that are annotated as MIRNAs there will be an additional entry for the mature microRNA position. The 'score' column in the gff3 format stores the number of sRNA-seq aligned reads at that locus.

known_miRNAs.gff3

When the user provides known RNA sequences via --known_miRNAs, they are aligned to the reference genome. Every (perfect) alignment to the reference is stored and reported in the known_miRNAs.gff3 file. The score column shows the number of alignments that start and end at the exact coordinates and strand.

Important : known_miRNAs are aligned and shown in the known_miRNAs.gff3 file regardless of whether any empirical small RNA-seq data are found. Thus, expect to find entries with a score of 0; these are cases where no instances of the given knownRNA were aligned to that location in the genome.

strucVis/

The directory strucVis/ contains visualizations of each locus that was annotated as a MIRNA locus. These are made by the script strucVis. For each locus there is a postscript file and a plain-text file. Both show the coverage of aligned small RNA-seq data as a function of position, aligned with the predicted RNA secondary structure of the inferred MIRNA hairpin precursor. These files are meant for manual inspection of MIRNA loci.

mir.fasta

This is a FASTA formatted file containing hairpin, mature miRNA, and miRNA sequences derived from ShortStack's identification of MIRNA loci. These are genomic sequences, and the genomic coordinates are noted in the FASTA header. ShortStack's determination of mature miRNA vs. miRNA strands is based on abundance of alignments at that particular locus. These designations may not always be accurate for an entire MIRNA family .. sometimes one paralog can attract most of the true mature miRNA alignments, leaving the other paralogs with mostly true miRNA* alignments. Take care when performing annotations.

.fastq(.gz) file(s)

If raw reads were trimmed by ShortStack, the trimmed fastq files will be found. The names will have a lower-cased 't' appended to the front to signify "trimmed". If the input files were .gz compressed, then the trimmed files will be too.

.bam file(s)

If the ShortStack run was aligning reads, one or more .bam files will be found. The bam format stores large scale alignment data. Corresponding bam index files (.bam.csi) will also be found.

.bw (bigwig) files

If the ShortStack run was aligning reads, multiple .bw files in the bigwig format will be produced. These bigwig files are produced by the program ShortTracks and are useful for visualizing data on genome browsers (in particular JBrowse2). The scales of the bigwig files are normalized to reads per million; thus, multiple tracks can be compared.

Types of bigwig files produced by ShortStack:

The README for ShortTracks has details about how to load these data onto JBrowse2 using "multi-wiggle" tracks for nice visualization of sRNA-seq data.

Visualizing Results

Loci annotated as MIRNA can be visualized from the srucVis/ files. These show the predicted RNA secondary structures with the small RNA-seq read depth coverage.

Genome Browsers

The output of ShortStack is designed to work with genome browsers. Specifically, the files Results.gff3, known_miRNAs.gff3, the .bam files, and the .bw files can be directly visualized on either major genome browser (IGV, JBrowse).

JBrowse2 has the ability to create "multi-wiggle" tracks. These tracks show multiple quantitative data tracks at once, bound to a common quantitative axis. The .bw bigwig files created by ShortStack & ShortTracks are normalized to reads-per-million, allowing direct comparisons in a multi-wiggle track. This allows visualization of size, coverage, and strandedness of the data. See the README for ShortTracks for details. I recommend using the Desktop version of JBrowse2.

Overview of Methods

Read trimming

Read trimming using --autotrim assumes that the input FASTQ data comes from forward strand sequence reads where the first letter of the read corresponds to the first nucleotide of the sense-strand small RNA. Read trimming seeks to first identify the 3' adapter sequence used in the FASTQ file, and then to use the tool cutadapt to remove the adapters. Using the recommended --autotrim setting, the sequence of the 3' adapter is automatically detected. This is accomplished by looking for reads that begin with an --autotrim_key, which by default is the miR166 sequence 'TCGGACCAGGCTTCATTCCCC'. The adapter is inferred by looking at the sequences that follow the --autotrim_key. The cutadapt settings used for read trimming are cutadapt -a [adapter] -o [tInput.fastq] -j [--threads] --discard-untrimmed -m 12 --report minimal [Input.fastq]. These settings discard untrimmed reads (where no adapter was found), and only retained reads that, after trimming, are at least 12 nucleotides long.

Indexing

Two types of genome indices are required. The first is an .fai index, created by samtools faidx. The second is a bowtie index of the genome file, which comprises several files ending with .ebwtl. If either index is missing, ShortStack will attempt to create it. The bowtie index is created using the bowtie-build executable from bowtie using default settings and the number of threads given by the --threads argument. Any index files created by ShortStack are written to the same location as the --genomefile. Therefore, the --genomefile should be located in a writable directory if the required indices are not already built at the start of the run.

Alignments

Alignment of trimmed fastq data uses bowtie. There are usually two phases to alignment. In the first phase reads are mapped to the genome using bowtie settings bowtie -p [--threads] -v 1 -k 50 -S --best --strata -x [--genomefile] [trimmedFASTQ]. This allows zero or one mismatch hits, keeping only the zeroes if both zero and one-hit cases exist. All hits up to a maximum of 50 are stored for multi-mapping reads. In the second phase a single location for each of the multimapping reads is decided upon. This decision is made by one of the three possible methods set by --mmap : u, f, or r. Setting u uses a local weighting scheme set by only the uniquely mapping alignments in an area. Setting f uses a local weighting scheme that includes both the unique and multi-mapped possibilities in an area. Setting r just randomly takes one of the possible location as the reported one. These methods are fully described in Johnson et al., 2016. Each individual FASTQ file is processed and converted to a sorted BAM file. If more than one FASTQ file was input, a single sorted merged file, merged_alignments.bam is created.

Cluster discovery

Genomic intervals where the depth of small RNA coverage, in reads-per-million, is greater than or equal to --mincov (1 by default) are identified. Each of these intervals are then extended in both directions by the length given by setting --pad. Regions that overlap after extension are merged. Note that if one or more MIRNA loci are later found to overlap the intial cluster, the initial cluster is removed from the output (and only the refined, trimmed MIRNA region(s) are reported).

MIRNA annotation

MIRNA annotation has two entry points for initial searches: Locations of aligned user-provided sequences from --known_miRNAs and, if option --dn_mirna is True, any 21 or 22 nt read whose abundance exceeds the depth of --mincov. From these initial starting points, ShortStack first examines the local region to find miR/miR-star-like patterns of read accumulation (essentially, "two-peaks" of read coverage on the same genomic strand that might correspond to the miR/miR-star pair). If such a pattern is found, the RNA secondary structure in the local area is predicted. The sRNA-seq alignments in conjunction with the predicted RNA secondary structure are analyzed with respect to the criteria in Axtell and Meyers, 2018. If the criteria are met, the locus is annotated as a MIRNA.

How to go FAST

ShortStack version 4 Major Changes

ShortStack version 4 is a major update. The major changes are:

Major Changes

Option changes:

Issues

Please post issues, comments, bug reports, questions, etc. to the project github page at https://github.com/MikeAxtell/ShortStack.

FAQ