PatrickRWright / CopraRNA

Target prediction for prokaryotic trans-acting small RNAs
MIT License
4 stars 3 forks source link

CopraRNA GitHub Bioconda Docker Repository on Quay

CopraRNA

Phylogenetic target prediction for prokaryotic trans-acting small RNAs

CopraRNA is a tool for sRNA target prediction. It computes whole genome target predictions by combination of distinct whole genome IntaRNA predictions. As input CopraRNA requires at least 3 homologous sRNA sequences from 3 distinct organisms in FASTA format. Furthermore, each organisms' genome has to be part of the NCBI Reference Sequence (RefSeq) database (i.e. it should have exactly this NZ_ or this NC_XXXXXX format where stands for any character and X stands for a digit between 0 and 9). Depending on sequence length (target and sRNA), amount of input organisms and genome sizes, CopraRNA can take up to 24h or longer to compute. In most cases it is significantly faster. It is suggested to run CopraRNA on a machine with at least 8 GB of memory.

CopraRNA produces a lot of file I/O. It is suggested to run CopraRNA in a dedicated empty directory to avoid unexpected behavior.

For testing or ad hoc use of CopraRNA, you can use its webinterface at the

==> Freiburg RNA tools CopraRNA webserver <==

Citation

If you use CopraRNA, please cite our articles



Documentation

Overview

The following topics are covered by this documentation:



Installation

In order to use CopraRNA you can either install it directly via conda or clone this github repository and install the dependencies individually. It is also possible to run CopraRNA via a provided Docker container.



Dependencies

The following setup was successfully used to build and run CopraRNA via conda:

name: CopraRNA-2.1.3

channels:
  - conda-forge
  - bioconda
  - defaults
  - r
  - conda

dependencies:
    - blast-legacy 
    - bzip2
    - clustalo
    - coreutils 
    - domclust
    - embassy-phylip
    - emboss 
    - gawk
    - grep
    - intarna >2.2
    - mafft 
    - perl <6
    - perl-bioperl 
    - perl-bio-eutilities
    - perl-getopt-long
    - perl-list-moreutils 
    - perl-parallel-forkmanager
    - phantomjs
    - python
    - r-base <4
    - r-pheatmap
    - r-robustrankaggreg
    - r-seqinr
    - sed
    - suds-jurko

The following package versions were tested and functional during development of CopraRNA2.

  • bzip2 1.0.6 (for the core genome archive) // conda install bzip2

  • gawk 4.1.3 // conda install gawk

  • sed 4.2.2.165-6e76-dirty // conda install sed

  • grep 2.14 // conda install grep

  • GNU coreutils 8.25 // conda install coreutils

  • IntaRNA 2.1.0 // conda install intarna

  • EMBOSS package 6.5.7 - distmat (creates distance matix from msa) // conda install emboss

  • embassy-phylip 3.69.650 - fneighbor (creates tree from dist matrix) // conda install embassy-phylip

  • ncbiblast-2.2.22 // conda install blast-legacy

  • DomClust 1.2.8a // conda install domclust

  • MAFFT 7.310 // conda install mafft

  • clustalo 1.2.3 // conda install clustalo

  • phantomjs 2.1.1-0 // conda install phantomjs

  • Perl (5.22.0) Module(s): // conda install perl

    • List::MoreUtils 0.413 // conda install perl-list-moreutils
    • Parallel::ForkManager 1.17 // conda install perl-parallel-forkmanager
    • Getopt::Long 2.45 // conda install perl-getopt-long
    • Bio::SeqIO (bioperl 1.6.924) // conda install perl-bioperl
    • Bio::DB::EUtilities 1.75 // conda install perl-bio-eutilities
    • Cwd 3.56 // included in the conda perl installation
  • R statistics 3.2.2 // conda install r-base==3.2.2

    • seqinr 3.1_3 // conda install r-seqinr
    • robustrankaggreg 1.1 // conda install r-robustrankaggreg
    • pheatmap 1.0.8 // conda install r-pheatmap
  • python // conda install python

    • sys // available from conda python
    • logging // available from conda python
    • traceback // available from conda python
    • suds.metrics (suds-jurko 0.6) // conda install suds-jurko
    • suds (suds-jurko 0.6) // conda install suds-jurko
    • suds.client (suds-jurko 0.6) // conda install suds-jurko
    • datetime // available from conda python



CopraRNA via conda (bioconda channel)

The most easy way to locally install CopraRNA is via conda using the bioconda channel (linux only). This way, you will install CopraRNA along with all dependencies. Follow install with bioconda to get detailed information. We recommend installing into a dedicated environment, to avoid conflicts with other installed tools. Following two commands install CopraRNA into the enviroment and activate it:

conda create -n coprarnaenv -c bioconda -c conda-forge coprarna
source activate coprarnaenv



Usage via biocontainer (docker)

CopraRNA can be retrieved and used as docker container with all dependencies via docker. Once you have docker installed simply type (with changed version):

       docker run -i -t quay.io/biocontainers/coprarna:2.1.0--0 /bin/bash



Cloning Source code from github (or downloading ZIP-file)

git clone https://github.com/PatrickRWright/CopraRNA

If you installed all dependencies you should be able to directly use the source.



Usage and parameters

Example call:

CopraRNA2.pl -srnaseq sRNAs.fa -ntup 200 -ntdown 100 -region 5utr -enrich 200 -topcount 200 -cores 4

The following options are available:

  • --help : help
  • --srnaseq : FASTA file with small RNA sequences (def:input_sRNA.fa)
  • --region : region to scan in whole genome target prediction (def:5utr)
    • '5utr' for start codon
    • '3utr' for stop codon
    • 'cds' for entire transcript
  • --ntup : amount of nucleotides upstream of '--region' to parse for targeting (def:200)
  • --ntdown : amount of nucleotides downstream of '--region' to parse for targeting (def:100)
  • --cores : amount of cores to use for parallel computation (def:1)
  • --rcsize : minimum amount (%) of putative target homologs that need to be available for a target cluster to be considered in the CopraRNA1 part (see --cop1) of the prediction (def:0.5)
  • --winsize IntaRNA target (--tAccW) window size parameter (def:150)
  • --maxbpdist IntaRNA target (--tAccL) maximum base pair distance parameter (def:100)
  • --cop1 switch for CopraRNA1 prediction (def:off)
  • --cons controls consensus prediction (def:0)
    • '0' for off
    • '1' for organism of interest based consensus
    • '2' for overall consensus based prediction
  • --verbose : switch to print verbose output to terminal during computation (def:off)
  • --websrv : switch to provide webserver output files (def:off)
  • --noclean : switch to prevent removal of temporary files (def:off)
  • --enrich : if entered then DAVID-WS functional enrichment is calculated with given amount of top predictions (def:off)
  • --nooi : if set then the CopraRNA2 prediction mode is set not to focus on the organism of interest (def:off)
  • --ooifilt : post processing filter for organism of interest p-value 0=off (def:0)
  • --root : specifies root function to apply to the weights (def:1)
  • --topcount : specifies the amount of top predictions to return and use for the extended regions plots (def:200)



Update CopraRNA available organisms

In the update_kegg2refseq directory you create a new run directory

mkdir run

and change into this directory

cd run 

Here you can execute build_kegg2refseq.pl

../build_kegg2refseq.pl 

which will download prokaryotes.txt from the NCBI and process it into the files CopraRNA_available_organisms.txt and kegg2refseqnew.csv. These two files must then be copied into coprarna_aux where they override their older versions.