A pipeline that automatically analyzes RNA-Seq data
RNA-Seq is a new transcriptome research method, with high efficiency, high sensitivity, and full genome analysis (for any species without pre-designing probes) and other advantages. Currently, a variety of analysis tools have been developed for RNA-Seq data, including data preprocessing, sequence alignment, transcriptome assembly, gene expression estimation, and non-coding RNA detection. However, these analysis tools basically exist independently, lacking a relatively complete system to integrate different tools to complete most of the analysis.
Raser was born from this. He helps you realize most of the software installation-free configuration, parameter configuration, accurate management of multiple samples, complete log management for each sample, and some visualization tasks
Raser requires the following software and data resources to be installed.
Note, if you can use our Docker images, then you'll have all the software pre-installed and can hit the ground running.
$ git clone --recursive git@github.com:clsteam/RASER.git
$ cd Raser
$ chmod 744 raser-manager
The --recursive parameter is needed to integrate the required submodules.
If necessary, you can add Raser to your environment variables, which will be handy for future use, like this:(Add to ~/.bashrc will take effect permanently)
export PATH=$PATH:/PATH_TO_RASER/
$ pip3 install -r requirements.txt
Before running, please make sure that your running parameters are correct. Please check the configuration item for parameter configuration instructions.
$ raser-manager ve -i ./config.ini
If you want to submit a task to run on the PBS server, you can add the -s/--server
parameter:
$ raser-manager ve -i ./config.ini -s
All parameter configurations are divided into configuration files to facilitate classified management and operation. You can enter --help
to view other available command line parameters:
$ raser-manager ve -i ./config.ini --help
usage: raser-manager [-h] [-i INI] [-s] [-t]
[-l {spam,debug,verbose,info,notice,warning,success,error,critical}]
[-c] [-m]
{ve,pl}
positional arguments:
{ve,pl} ve: vertebrate, pl: plant
optional arguments:
-h, --help show this help message and exit
-i INI, --ini INI configuration file, default {RASER_HOME}/config.ini
-s, --server submit tasks to server compute nodes (PBS)
-t, --test for testing, only run a sample in the main process
-l {spam,debug,verbose,info,notice,warning,success,error,critical}, --level {spam,debug,verbose,info,notice,warning,success,error,critical}
logger level
-c, --comm output the complete command submitted by the task
-m, --sim simplified process
example:
.output_raser
├── allele
├── diff
├── assembly_gtf_list.txt
├── fusion
│ └── tophatfusion
│ └── bam
├── lnc_potential.gtf
├── lncrna.gtf
├── log
│ ├── pipe (the log file of each process or sample, named after the sample)
│ │ ├── SRR196226.e
│ │ ├── SRR196226.o
│ │ ├── SRR196227.e
│ │ ├── SRR196227.o
│ │ ├── SRR196228.e
│ │ ├── SRR196228.o
│ │ ├── SRR196229.e
│ │ ├── SRR196229.o
│ │ ├── SRR196230.e
│ │ ├── SRR196230.o
│ │ ├── SRR196231.e
│ │ └── SRR196231.o
│ ├── RASERCMD (record all commands run by Raser)
│ ├── raser-PRJNA142905.e2538136 (main process log file)
│ ├── raser-PRJNA142905.o2538136
│ ├── single.e (additional log file in other languages such as R)
│ └── single.o
├── merged.gtf
├── ncRNA_out
│ ├── cnci
│ │ ├── ambiguous_genes.gtf
│ │ ├── compare_2_infor.txt
│ │ ├── filter_out_noncoding.gtf
│ │ ├── novel_coding.gtf
│ │ └── novel_lincRNA.gtf
│ ├── CNCI.index
│ ├── CPC.txt
│ ├── lnc.fasta
│ ├── lncfinder.R
│ └── lnc_predict.statistics
├── origin.annotated.gtf
├── origin.loci
├── origin.merged.gtf.refmap
├── origin.merged.gtf.tmap
├── origin.stats
└── origin.tracking
.ERR315326
├── alter_splice_out
│ ├── prefix_1.combined.gtf
│ ├── prefix_1.loci
│ ├── prefix_1.redundant.gtf
│ ├── prefix_1.stats
│ ├── prefix_1.tracking
│ ├── prefix_2.as.nr
│ ├── prefix_2.as.summary
│ ├── ERR315326.xfpkm
│ └── total.as
├── ERR315326_1_clean.fq.gz
├── ERR315326_2_clean.fq.gz
├── ERR315326.bam
├── ERR315326.bam.bai
├── ERR315326.counts
├── ERR315326.counts.summary
├── ERR315326.lnc.counts
├── ERR315326.lnc.counts.summary
├── fastqc_out
│ ├── adapter.fa
│ ├── ERR315326_1_clean_fastqc.html
│ ├── ERR315326_1_fastqc.html
│ ├── ERR315326_2_clean_fastqc.html
│ └── ERR315326_2_fastqc.html
├── tophat_out
│ └── align_summary.txt
├── transcript_out
│ └── transcripts.gtf
├── tree.MD
└── variation_out
├── ERR315326.vcf.gz
├── ERR315326.vcf.gz.tbi
├── org.vcf.gz
└── org.vcf.gz.tbi
raser/setting.py
,宁外一个是config.ini
:config.ini
(main configuration file) is designed to control the process, add samples, and modify tool parameters**[Root]
;require, Raser's output directory
path = /home/output_raser
[Cluster]
;Optional, the parameter items of the task submitted by the PBS server (task name, node, total number of threads, total time limit)
name = pop23
nodes = comput9
ppn = 24
walltime = 200:00:00
[Resource]
; Require, the number of running processes
pools = 6
[Workflow]
; Require, select the project module that needs to be run
differentialexpression = True
allele = True
altersplice = False
fusion = False
lncrna = False
[SampleDir]
; Require, sample name and dictionary
SRP028829=
/home/populus/SRP028829
/home/populus/SRP028830
SRP033639=
/home/populus/SRP033639
[SampleMessage]
; Species, sample sequencing information (phred, library_type)
;require, such as humo
species = populus
;optional, phred33 or phred64
phred =
;optional, fr-unstranded, fr-firststrand or fr-secondstrand
library_type =
[Treatment]
;Optional, sample phenotype
header_name = Run,Treatment
file = /home/populus/treat.csv
[Genome]
home_dir = /home/populus
;require, genome file
genomefile = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.fa
;Optional, genome reference annotation file
annotations = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.gff
;Optional, index file (if the index has been established, Raser skips this step by default, which can greatly reduce the running time)
bowtie1_index = ${home_dir}/hg_bowtie1
bowtie2_index = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic
hisat2_index = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic_hisat2
star_index =
annotations_gtf =
hisat2_splicesites_txt =
bed = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.bed
hdrs = ${home_dir}/GCF_000495115.1_PopEup_1.0_genomic.fa.hdrs
[Lncrna]
;Optional, LncRna reference notes and selection criteria
known_lncrna_gtf =
min_length = 200
min_cov = 0
min_fpkm = 0
[Fusion]
;Optional, STAR-Fusion configuration item
starfusion_genome_resource_lib = /home/tools/STAR-Fusion-extra-files/populus/ctat_genome_lib_build_dir
[Allele]
;optional,
; dbsnp, used to annotate snp while calling snp
dbsnp =
; list of sites to blacklist from phasing. The file we are providing contains all HLA genes.
hla_bed =
; list of sites to blacklist when generating allelic counts. These are sites that we have previously identified as having mapping bias, so excluding them will improve results.
haplo_count_bed =
SampleDir:
raser/setting.py
aims to select analysis tools# The tool is used as a guideline
# All strings must be lowercase
TOOLS_SELECTED = {
"qualitycontrol": "fastqc",
"trim": "trimmomatic",
"alignment": "tophat2", # tophat2, hisat2, star
"rmdup": "samtools", # samtools, picard
"genecount": "featurecounts", # htseq, featurecounts, star
"strandspecific": "", # rseqc
"transcript": "stringtie", # cufflinks, stringtie
"variation": "gatk", # samtools, gatk
"differentialexpression": "deseq2", # ballgown, deseq2, edger
"altersplice": "asprofile", # asprofile
"fusion": "tophatfusion", # tophatfusion, starfusion
"lncrna": "cc", # cc
"allele": "phaser", # phaser
}
# Reads the minimum length reserved
MINLEN = 50
# default Read-Group platform (e.g. ILLUMINA, SOLID, LS454, HELICOS and PACBIO)
RGPL = "ILLUMINA"
# Whether to use GTF format as the first choice for the process, the default is False (GTF compatibility is better, especially when STAR builds indexes)
PRIMARY_GTF_ANNOTATIONS = False
# The quality of one end of the double-ended data sheet is very poor, and the high-quality end can be reserved for single-ended analysis
WHETHER_PE_TO_SE = True
# Whether to add a reference comment when comparing, the default is True
WHETHER_ALIGNMENT_WITH_ANNOTATIONS = True
# Keep only marking or removing PCR repeats (only valid for picard), the default is True
WHETHER_MARK_DUPLICATES_ONLY = True
# Automatically detect the chain specificity and use it, the default is False (it will take a lot of time to compare again)
STRAND_SPECIFIC_USE_AUTOMATICALLY = False
# Even if there is no control sample, compulsory assembly of transcripts, default False
ENFORCE_ASSEMBLY = False