RNAseq-pipeline

This pipeline requires several R packages. You can install them by running:

qsub pipeline_R_setup.sh

Make a directory to deposit subfolders and processed files: $DIR.

Often, $DIR/FASTQ contains the fastq files for this experiment but the fastq files don't necessarily have to be in this location.

In $DIR, make text file called samples.manifest which has to be a manifest file just like the ones used in Rail-RNA and Myrna. The exception is that the paths have to be local as no URLs are supported. It has the following structure:

(for a set of unpaired input reads) <FASTQ FILE>(tab)<optional MD5>(tab)<sample label>
(for a set of paired input reads) <FASTQ FILE 1>(tab)<optional MD5 1>(tab)<FASTQ FILE 2>(tab)<optional MD5 2>(tab)<sample label>

The following extensions are supported: fastq.gz, fq.gz, fastq, fq.

For example, a samples.manifest file can look like this (these are paired-end samples):

/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R1_001.fastq.gz 0   /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R2_001.fastq.gz   0   sample1
/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R1_001.fastq.gz 0   /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R2_001.fastq.gz   0   sample2

If you reads are split in multiple files and you want to merge them, simply repeat a sample id. The merged files will be saved at in a directory called merged_fastq.

/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R2_001.fastq.gz 0   /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R2_001.fastq.gz   0   sample1
/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R2_001.fastq.gz 0   /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R2_001.fastq.gz   0   sample1

Run pipeline with following call in $DIR:
```
## You need a compute node to run this script!
qrsh
## Change the path if you cloned this repo elsewhere!
bash /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/sh/rnaseq-run-all.sh --help
```
Note that you have to use bash and not sh, otherwise you'll get an error like.
1. experiment: main identifier, experiment name.
2. prefix: additional identifier (could be date or reference genome used).
3. reference: supported genomes are hg19, hg38, mm10, rn6
4. stranded: FALSE by default. The other valid options are forward and reverse.
5. ercc: FALSE by default and will be turned to TRUE if specified. Specify if ERCC mix 1 was added.
6. cores: defaults to 8. Specifies how many cores to use per job for the jobs that are parallelized.
7. large: TRUE if you want to use double the default memory settings. Useful for large projects (many samples and/or many reads). FALSE by default and doesn't need to be specified.
8. fullcov: TRUE if you want to create the fullCoverage object. Set to FALSE by default. Note that the fullCoverage object is not needed (by default) since we create the normalized mean BigWig already and can use it to define the ERs with derfinder::findRegions(), then use the resulting GRanges object and the paths to the BigWig files in derfinder::getRegionCoverage(fullCov = NULL, files = bigWigs, regions = outputFrom_findRegions) or alternatively write the regions to a BED file with rtracklayer, create the counts with bwtool and then read them into R manually (similar to what we did in recount-website).
9. bashfolder: defaults to /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/sh. It's the directory where the shell files are located at. You only need to specify it if you cloned this repository somewhere else.
10. annofolder: defaults to /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation. We currently don't have scripts for you to automatically reproduce that directory, but you can copy it to another location if you want to have more control on it and change some files.
Hidden run files will be created with all calls needed to run the pipeline. Each step is submitted to SGE cluster and queued to run sequentially. Steps that can be parallelized are submitted as array jobs. If you with to manually run a given step, check the default options by using the --help option. For example:
```
## Change the path if you cloned this repo elsewhere!
bash /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/sh/step3-hisat2.sh --help
```
Completion emails: by default you will only get an email if a job failed. If you want to get completion emails add the empty file ${DIR}/.send_emails. You can create it with:
```
touch ${DIR}/.send_emails
```
Cluster queue: by default the shared queue is used, but if you want to specify one create the hidden file ${DIR}/.queue with the name of the queue inside of it (no spaces, no new lines). For example:
```
bluejay
```
Do not save shared into ${DIR}/.queue. Otherwise your jobs won't run.

For reproducibility purposes, the information about the version of the pipeline and contents of ANNO_FOLDER will be stored in logs/pipeline_information.txt.

First time user

Note that you will need the jaffelab package installed among other packages. Most of them can be installed automatically by scripts in the pipeline, but jaffelab requires you to create the GITHUB_PAT environment variable following the instructions by devtools::install_github().

Read and alignment metrics output

FastQC Flags (pass/warn/fail)

Flag	Description
FQCbasicStats	Composite statistic: overall pass/warn/fail
perBaseQual	Quality values across all bases at each position in the FastQ file
perTileQual	Quality scores from each tile across all bases to see if there was loss in quality associated with only one part of the flowcell
perSeqQual	Report if a subset of sequences have universally low quality values
perBaseContent	G, A, T, and C should have roughly equal proportions at each base position
GCcontent	GC content across the length of each sequence
Ncontent	Percentage of base calls at each position for which an N was called
SeqLengthDist	Distribution of fragment sizes (should all be uniform length)
SeqDuplication	Degree of duplication for every sequence in the library
OverrepSeqs	Identical sequences that appear at a higher rate than expected
AdapterContent	Proportion of your library which has adapter sequences at each position
KmerContent	Measures the number of each kmer at each position in your library and then looks for significant deviations from an even coverage at all positions

FastQC metrics (There will be two sets of metrics for each sample if library is paired-end.)

Metric	Description
SeqLength	Length of the shortest and longest sequence in the set
percentGC	Overall %GC of all bases in all sequences
phredXX-XX	Median Phred score at read positions XX (taken at different positions depending on sequence length)
phredGT30	Proportion of reads above Phred score of 30
phredGT35	Proportion of reads above Phred score of 35
AdapterXX-XX	Adapter content for Illumina standard adapter at positions XX (taken at different positions depending on sequence length)

HISAT2 metrics

Metric	Description
numReads	Total number of reads in fastq file
numMapped	Number of reads that were aligned by HISAT2 to the reference genome
numUnmapped	Number of reads that did not align to the reference genome
overallMapRate	Proportion of total reads mapped (numMapped / numReads)
concordMapRate	If library is paired-end, percentage of read pairs that mapped concordantly

LieberInstitute / RNAseq-pipeline

readme

RNAseq-pipeline

First time user

Read and alignment metrics output