NAL-i5K / NAL_RNA_seq_annotation_pipeline

Other
5 stars 3 forks source link

NAL RNA-Seq Annotation Pipeline (Under Development)

Build Status

A RNA-Seq annotation pipeline based on SRA Toolkit, fastQC, Trimmomatic, HISAT2, BBMap, picard, GATK3, and samtools. It's distributed as a python package.

There are two parts in this pipeline.

Rnannot part - doing sequence alignment and converting output to bigwig format. The outputs of this part should contains one bam file, one indexed bam file, one bigwig file, one bed file and one Source.txt file. If temp files are kept in this part(option -t), it will also generate one unsorted bam file, one bam and one sam file generated by SINGLE layout SRA files, one bam and one sam file generated by PAIRED layout SRA files.

Add_trackList part - adding bam, bigwig file and junction reads to trackList.json file on apollo-stage server. This part should be run on apollo-stage server. It will transfer output files of anannot to apollo-stage and node1 server and update the trackList.json file.

Prerequisite

For rnannot

For add_trackList, the following JBrowse processing scripts are needed

Installation by yourself

After you set up all of the prerequisites, run setup.py file for installing.

Uninstallation

Usage


download_sra_metadata.py [-t TAXID] [-o [OUTPUT]]

Use pipeline to download the sra metadata,the output file will be used for the input file of RNAseq_annotate.py.

optional arguments:
  -t TAXID, --taxid TAXID    find all RNA SRA files for a given taxid
  -o [OUTPUT], --output [OUTOUT] directory and name of output folder at, if not specified, use current folder

download_sra_metadata_by_accessions.py [-a ACCESSION] [-o [OUTPUT]]

Download the sra metadata of specified accessions, the output file will be the input file of RNAseq_annotate.py. 
If processing multiple accessions, use spaces between each accession and output file directory.

optional arguments:
  -a ACCESSION, --accession ACCESSION    find the RNA SRA files for the given accessions
  -o [OUTPUT], --output [OUTOUT]    output directory and output file name
                                    if not specified, use current folder and file name "ACCESSION.tsv"

RNAseq_annotate.py [-h] [-i INPUT] [-g GENOME] [-n [NAME]]
                               [-o [OUTDIR]] [-a ASSEMBLY] [-t]
                               [-m MAXIMUMSRA]

Easy to use pipeline built for large-scale RNA-seq mapping with a genome
assembly

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        A tsv file with a list of SRA runs' information.
  -g GENOME, --genome GENOME
                        A fasta file to align with.
  -n [NAME], --name [NAME]
                        name of the output folder, if not specified, use the
                        time of start
  -o [OUTDIR], --outdir [OUTDIR]
                        directory of output folder - it must already exit. if not specified, use
                        current folder
  -a ASSEMBLY, --assembly ASSEMBLY
                        The assembly name is used for naming output file
  -t, --tempFile        
                        if specified, intermediate output bam files will be kept
  -m MAXIMUMSRA, --MaximumSRA MAXIMUMSRA
                        The maximum amout of SRA files downloaded from NCBI. The default is 10

add_trackList.py [-h] [-a INPUT_ACCOUNT] [-p INPUT_PATH]
                        [-bam INPUT_BAM] [-bigwig INPUT_BIGWIG]
                        [-bai INPUT_BAI] [-bed INPUT_BED] [-track INPUT_TRACK]
                        [-s SOURCE]

optional arguments:
  -h, --help            show this help message and exit
  -a INPUT_ACCOUNT, --input_account INPUT_ACCOUNT
                        scinet account e.g user@login.scinet.science
  -p INPUT_PATH, --input_path INPUT_PATH
                        path of RNA_annotation output files on Scinet
  -bam INPUT_BAM, --input_bam INPUT_BAM
                        bam file name
  -bigwig INPUT_BIGWIG, --input_bigwig INPUT_BIGWIG
                        bigwig file name
  -bai INPUT_BAI, --input_bai INPUT_BAI
                        indexed bam file name
  -bed INPUT_BED, --input_bed INPUT_BED
                        indexed bed file name
  -track INPUT_TRACK, --input_track INPUT_TRACK
                        trackList.json file path
  -s SOURCE, --Source SOURCE
                        Source.txt file name

move_data.py [-h] [-Node1a NODE1_ACCOUNT] [-s SOURCE]

optional arguments:
  -h, --help            show this help message and exit
  -Node1a NODE1_ACCOUNT, --node1_account NODE1_ACCOUNT
                        apollo-nodea account e.g user@apollo-
                        node1,nal.usda.gov
  -s SOURCE, --Source SOURCE
                        Source.txt file path

Example

Rnannot

Add_trackList

Run on Docker container

We provide docker image which includes all of the prerequisites and has everything installed. It also contains the whole repo, so you don't need to clone this repo if you use this docker container. The working directory of this image is set to /opt/output and all of the repo files are in /opt/RNA_repo

To get this docker image, you can:

  1. Build this image by Dockerfile. clone this repo and run sudo docker build -t [your_image_tag_name] .

    or

  2. Pull this image from docker hub. run sudo docker pull k2025242322/i5k_rna_seq_annotation_pipeline:latest

Docker commands:

Run on Ceres (by conda virtual environment)

1. Setup conda env

2. Git clone RNA_annotation_pipeline into working directory

3. Activate conda env

4. Install pysam

5. Run setup.py

6. Use bash script to submit job

Run on Ceres (by singularity container)

You can find more information about singularity here: https://scinet.usda.gov/guide/singularity

1. Pull docker image from docker hub

2. Edit bash script

Notes

Tests

Test environment

Test parser