clsteam / Micro-project

Micro-project
3 stars 2 forks source link

Raser

A pipeline that automatically analyzes RNA-Seq data

Introduction

RNA-Seq is a new transcriptome research method, with high efficiency, high sensitivity, and full genome analysis (for any species without pre-designing probes) and other advantages. Currently, a variety of analysis tools have been developed for RNA-Seq data, including data preprocessing, sequence alignment, transcriptome assembly, gene expression estimation, and non-coding RNA detection. However, these analysis tools basically exist independently, lacking a relatively complete system to integrate different tools to complete most of the analysis.

Raser was born from this. He helps you realize most of the software installation-free configuration, parameter configuration, accurate management of multiple samples, complete log management for each sample, and some visualization tasks

Installing Raser

Raser requires the following software and data resources to be installed.

Note, if you can use our Docker images, then you'll have all the software pre-installed and can hit the ground running.

1. Downloading from GitHub Clone

    $ git clone --recursive git@github.com:clsteam/RASER.git
    $ cd Raser
    $ chmod 744 raser-manager

The --recursive parameter is needed to integrate the required submodules.

If necessary, you can add Raser to your environment variables, which will be handy for future use, like this:(Add to ~/.bashrc will take effect permanentlyexport PATH=$PATH:/PATH_TO_RASER/

2. Tools Required

    $ pip3 install -r requirements.txt

Running Raser

Before running, please make sure that your running parameters are correct. Please check the configuration item for parameter configuration instructions.

 image

Result

Configuration

* Raser将所有的软件运行参数都放入了配置文件中,分成两个部分,一个是raser/setting.py,宁外一个是config.ini:

1. config.ini (main configuration file) is designed to control the process, add samples, and modify tool parameters**
[Root]
;require, Raser's output directory
path = /home/output_raser

[Cluster]
;Optional, the parameter items of the task submitted by the PBS server (task name, node, total number of threads, total time limit)
name = pop23
nodes = comput9
ppn = 24
walltime = 200:00:00

[Resource]
; Require, the number of running processes
pools = 6

[Workflow]
; Require, select the project module that needs to be run
differentialexpression = True
allele = True
altersplice = False
fusion = False
lncrna = False

[SampleDir]
; Require, sample name and dictionary
SRP028829=
    /home/populus/SRP028829
    /home/populus/SRP028830
SRP033639=
    /home/populus/SRP033639

[SampleMessage]
; Species, sample sequencing information (phred, library_type)
;require, such as humo
species = populus

;optional, phred33 or phred64
phred =
;optional, fr-unstranded, fr-firststrand or fr-secondstrand
library_type =

[Treatment]
;Optional, sample phenotype
header_name = Run,Treatment
file = /home/populus/treat.csv

[Genome]
home_dir = /home/populus
;require, genome file
genomefile = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.fa
;Optional, genome reference annotation file
annotations = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.gff
;Optional, index file (if the index has been established, Raser skips this step by default, which can greatly reduce the running time)
bowtie1_index = ${home_dir}/hg_bowtie1
bowtie2_index = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic
hisat2_index = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic_hisat2
star_index =
annotations_gtf =
hisat2_splicesites_txt =
bed = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.bed
hdrs = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.fa.hdrs

[Lncrna]
;Optional, LncRna reference notes and selection criteria
known_lncrna_gtf =
min_length = 200
min_cov = 0
min_fpkm = 0

[Fusion]
;Optional, STAR-Fusion configuration item
starfusion_genome_resource_lib = /home/tools/STAR-Fusion-extra-files/populus/ctat_genome_lib_build_dir

[Allele]
;optional,
; dbsnp, used to annotate snp while calling snp
dbsnp =
; list of sites to blacklist from phasing. The file we are providing contains all HLA genes.
hla_bed =
; list of sites to blacklist when generating allelic counts. These are sites that we have previously identified as having mapping bias, so excluding them will improve results.
haplo_count_bed =

SampleDir:

2. raser/setting.py aims to select analysis tools
# The tool is used as a guideline
# All strings must be lowercase
TOOLS_SELECTED = {
    "qualitycontrol": "fastqc",
    "trim": "trimmomatic",
    "alignment": "tophat2",  # tophat2, hisat2, star
    "rmdup": "samtools", # samtools, picard
    "genecount": "featurecounts",  # htseq, featurecounts, star
    "strandspecific": "",   # rseqc
    "transcript": "stringtie",  # cufflinks, stringtie
    "variation": "gatk",  # samtools, gatk
    "differentialexpression": "deseq2",  # ballgown, deseq2, edger
    "altersplice": "asprofile",  # asprofile
    "fusion": "tophatfusion",  # tophatfusion, starfusion
    "lncrna": "cc",  # cc
    "allele": "phaser",  # phaser
}
# Reads the minimum length reserved
MINLEN = 50
# default Read-Group platform (e.g. ILLUMINA, SOLID, LS454, HELICOS and PACBIO)
RGPL = "ILLUMINA"
# Whether to use GTF format as the first choice for the process, the default is False (GTF compatibility is better, especially when STAR builds indexes)
PRIMARY_GTF_ANNOTATIONS = False
# The quality of one end of the double-ended data sheet is very poor, and the high-quality end can be reserved for single-ended analysis
WHETHER_PE_TO_SE = True
# Whether to add a reference comment when comparing, the default is True
WHETHER_ALIGNMENT_WITH_ANNOTATIONS = True
# Keep only marking or removing PCR repeats (only valid for picard), the default is True
WHETHER_MARK_DUPLICATES_ONLY = True
# Automatically detect the chain specificity and use it, the default is False (it will take a lot of time to compare again)
STRAND_SPECIFIC_USE_AUTOMATICALLY = False
# Even if there is no control sample, compulsory assembly of transcripts, default False
ENFORCE_ASSEMBLY = False