Fully Automated and Standardized iCLIP (FAST-iCLIP) is a fully automated tool to process iCLIP data. Please cite the following paper:
Zarnegar B, Flynn RA, Shen Y, Do BT, Chang HY, Khavari PA. Ultraefficient irCLIP pipeline for characterization for protein-RNA interactions. Nature Methods (2016)
This package contains two main sets of tools: an executable called fasticlip
to run iCLIP on human and mouse data, and several (possibly deprecated) iPython notebooks to process iCLIP data from viral genomes.
The following README will focus mainly on fasticlip
. The pdf in the repository contains further instructions for using the iPython notebooks.
Table of Contents
fasticlip [-h] -i INPUT [INPUT ...] [--trimmed] [--GRCh38 | --GRCm38] -n NAME -o OUTPUT [-f N] [-a ADAPTER] [-tr REPEAT_THRESHOLD_RULE] [-tn NONREPEAT_THRESHOLD_RULE] [-tv EXOVIRUS_THRESHOLD_RULE] [-bm BOWTIE_MAPQ] [-q Q] [-p P] [-l L] [-c C] [--verbose]
Example: fasticlip -i rawdata/example_MMhur_R1.fastq rawdata/example_MMhur_R2.fastq --GRCm38 -n MMhur -o results
Example: fasticlip -i rawdata/example_Hmhur_R1.fastq rawdata/example_Hmhur_R2.fastq --GRCh38 -n Hmhur -o results
Note that the current pipeline is compatible with only GRCh38 (human) and GRCm38 (mouse) assemblies. This is due to a tailored set of annotations used in the pipeline. We will release details of generating annotation files for other genomes shortly in future.
flag | description |
---|---|
-h, --help | show this help message and exit |
-i INPUT(s) | At least one input FASTQ (or fastq.gz) files; separated by spaces |
--GRCh38 | required if your CLIP is from human |
--GRCm38 | required if your CLIP is from mouse |
-n NAME | Name of output directory |
-o OUTPUT | Name of directory where output directory will be made |
flag | description |
---|---|
--trimmed | flag if files are already trimmed |
-f N | Number of bases to trim from 5' end of each read. Default is 14. If using irCLIP RT primers, this value should be 18. |
-a ADAPTER | 3' adapter to trim from the end of each read. Default is AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG. |
-tr REPEAT_THRESHOLD_RULE | m,n: at least m samples must each have at least n RT stops mapped to repeat RNAs. Default is 1,4 (1 sample); 2,3 (2 samples); x,2 (x>2 samples) |
-tv EXOVIRAL_THRESHOLD_RULE | m,n: at least m samples must each have at least n RT stops mapped to viral genome. Default is 1,4 (1 sample); 2,3 (2 samples); x,2 (x>2 samples) |
-tn NONREPEAT_THRESHOLD_RULE | m,n: at least m samples must each have at least n RT stops mapped to nonrepeat RNAs. Default is 1,4 (1 sample); 2,3 (2 samples); x,2 (x>2 samples) |
-bm BOWTIE_MAPQ | Minimum MAPQ (Bowtie alignment to repeat/tRNA/retroviral indexes) score allowed. Default is 42. |
-q Q | Minimum average quality score allowed during read filtering. Default is 25. |
-p P | Percentage of bases that must have quality > q during filtering. Default is 80. |
-l L | Minimum length of read. Default is 15. |
-c C | Number of cores used by bowtie2. Default is 8. |
--verbose | Prints out lots of things :) |
git clone git@github.com:ChangLab/FAST-iCLIP.git
if you use ssh authenticationgit clone https://github.com/ChangLab/FAST-iCLIP.git
otherwisecd FAST-iCLIP
to enter the folder../configure
. This will check for dependencies (below) and download necessary files (bowtie indices, gene lists and genomes, and example iCLIP data). Note that the configure will download a very large annotation file from Amazon that contains all necessary annotation files to run the pipeline. Please wait until all annotations are downloaded and extracted. No additional annotation file is needed. The annotations are compatible only with the tools specificed in the following.sudo python setup.py install
. If you do not have sudo privileges, run python setup.py install --user
or python setup.py install --prefix=<desired directory>
.FAST-iCLIP
: docs
, rawdata
, and results
.Add the following lines to your ~/.bashrc and ~/.bash_profile:
export FASTICLIP_PATH=~/.local/bin/
export PATH=$FASTICLIP_PATH:$PATH
Save the file, then run source ~/.bash_profile
.
fasticlip -i rawdata/example_MMhur_R1.fastq rawdata/example_MMhur_R2.fastq --GRCm38 -n MMhur -o results
. It should run in ~1 hour. Look inside results/MMhur
for output files.The version numbers listed have been tested successfully. There can be difficulties if you choose to run updated versions of some of these dependencies.
At least one FASTQ or compressed FASTQ (fastq.gz). Use the --trimmed
flag if trimming has already been done.
Three subdirectories inside the named directory within results
.
figures
has 6 figures in pdf and png format.
Figure 1 visualizes the some of the relevant summary data.
A. Read count summary per pipeline step. The source data is: PlotData_ReadsPerPipeFile
B. Bar graph of gene count per RNA type. The source data is: PlotData_ReadAndGeneCountsPerGenetype
C. Pie chart of RT stops mapping to known features of mRNAS including 5'UTR, Introns, CDS, and 3'UTR.
D. Pie chart of RT stops mapped to all indexes included in the FAST-iCLIP pipeline.
Figure 2 provides coverage histograms of binding across each repeat RNA element, both sense and antisense strands.
Source data: PlotDataRepeatRNAHist*
RT stops mapping to the positive and negative strands are shown in blue and red, respectively.
Figure 3 provides coverage histograms of binding across the rRNA, highlighting mature rRNA regions.
Source data: PlotDataRepeatRNAHist*
RT stops mapping to the positive and negative strands are shown in blue and red, respectively.
Figures 4a and 4b provide a summary of snoRNA binding data.
Histograms display RT stop position within an average snoRNA gene body.
The pie chart provides a summary of reads per snoRNA type.
Figure 5 provides histograms of RT stop position within gene body for all remaining ncRNA types.
Figure 6 provides a pie chart composed of RT stops from the top 15 best bound endoVirus elements.
Total RT stop counts per element and percentage of the total endoVirus mapped reads are included for each element in the legend.
Figure 7 provides histograms of RT stop position across the genome for any exoViruses (DV, ZV, or HCV).
RT stops mapping to the positive and negative strands are shown in blue and red, respectively.
rawdata
has all the PlotData files used to make the figures, as well as intermediate files that can be useful in generating other plots.
todelete
has files that are unnecessary to keep.
Prepare 2 FASTQ or FASTQ.gz files corresponding to the two replicates for an iCLIP/irCLIP experiment.
Duplicate removal, quality filter, and trim adapter from the 3' end from the reads
After duplicate removal, remove the 5' barcode sequence. Default removes 13 nts.
We then map the reads to indexes in the following order:
After mapping, we isolate the 5' position (RT) stop for both positive and negative strand reads.
For each replicate, we analyze the RT stop position and read length using iCLIPro.
For each strand, we merge RT stops between replicates.
Partition RT stops by gene type.
Quantification of reads per gene.
Partition protein coding reads by functional mRNA elements.
Partition reads by ncRNA binding region.
Partition repeat-mapped RT stops by region.