DCC: detect circRNAs from chimeric reads
DCC is a python package intended to detect and quantify circRNAs with high specificity. DCC works with the STAR (Dobin et al., 2013) chimeric.out.junction
files which contains chimerically aligned reads including circRNA junction spanning reads.
Installation
DCC depends on pysam, pandas, numpy, and HTSeq. The installation process of DCC will automatically check for the dependencies and install or update missing (Python) packages. Different installation options are available:
1) Download the latest stable DCC release <https://github.com/dieterich-lab/DCC/releases>
_
.. code-block:: bash
$ tar -xvf DCC-
$ cd DCC-
$ python setup.py install --user
2) git clone
.. code-block:: bash
$ git clone https://github.com/dieterich-lab/DCC.git
$ cd DCC
$ python setup.py install --user
Check the installation:
.. code-block:: bash
$ DCC --version
If the Python installation binary path [e.g. $HOME/.local/bin for Ubuntu] is not included in your path, it is also possible run DCC directly:
.. code-block:: bash
$ python
or even
$ python
Usage
The detection of circRNAs from RNAseq data through DCC can be summarised in three steps:
Mapping of RNAseq data from quality checked fastq files. For paired-end (PE) data it is recommended to map the pair jointly and separately. The reason is that STAR does not output reads or read pairs that contain more than one chimeric junction.
Prepare the input files required by DCC. In summary, only one file is mandatory: samplesheet
, which specifies the locations for the chimeric.out.junction
files (one relative or absolute path per line). [Command line flag: @samplesheet
]
A GTF format annotation of repetitive regions which is used to filter out circRNA candidates from repetitive regions. [Command line flag: -R your_repeat_file.gtf
]
For paired-end sequencing two files, e.g. mate1
and mate2
which contain the paths to the chimeric.out.junction
files originating from the separate mate mapping step. [Command line flags: -m1 mate1file
and -m2 mate2file
]
You may specify the location of the BAM files via the -B flag otherwise DCC tries to guess their location based on the supplied chimeric.out.junction
paths. [Command line flag: -B @bam_file_list
]
DCC requires the SJ.out.tab
files generated by STAR. DCC assumes they are located in the same folder as the BAM files and also retain their SJ.out.tab
.
DCC can be used to process circular RNA detection and host gene expression detection either in a one-pass strategy or might be used for only one part of the analysis:
-D
and -G
option-D
-G
Step by step tutorial with sample data set
In this tutorial, we use the data set from Westholm et al. 2014 <http://www.sciencedirect.com/science/article/pii/S2211124714009310>
as an example. The data are paired-end, stranded RiboMinus RNAseq data from Drosophila melanogaster, consisting of samples of 3 developmental stages (1 day, 4 days, and 20 days) collected from the heads. You can download the data from the NCBI SRA (accession number SRP001696 <http://www.ncbi.nlm.nih.gov/sra/?term=SRP001696>
).
STAR <https://github.com/alexdobin/STAR>
_Note: --alignSJoverhangMin
and --chimJunctionOverhangMin
should use the same value to make the circRNA expression and linear gene expression level comparable.
.. code-block:: bash
$ mkdir Sample1 $ cd Sample1 $ STAR --runThreadN 10 \ --genomeDir [genome] \ --outSAMtype BAM SortedByCoordinate \ --readFilesIn Sample1_1.fastq.gz Sample1_2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix [sample prefix] \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \
.. code-block:: bash
$ mkdir mate1 $ cd mate1 $ STAR --runThreadN 10 \ --genomeDir [genome] \ --outSAMtype None \ --readFilesIn Sample1_1.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix [sample prefix] \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --seedSearchStartLmax 30 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \
.. code-block:: bash
$ mkdir mate2 $ cd mate2 $ STAR --runThreadN 10 \ --genomeDir [genome] \ --outSAMtype None \ --readFilesIn Sample1_2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix [sample prefix] \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --seedSearchStartLmax 30 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \
chimeric.out.junction
files with DCCIt is strongly recommended to specify a repetitive region file in GTF format for filtering.
A suitable file can for example be obtained through the UCSC table browser <http://genome.ucsc.edu/cgi-bin/hgTables>
_ . After choosing the genome, a group like Repeats or Variation and Repeats has to be selected. For the track, we recommend to choose RepeatMasker together with Simple Repeats and combine the results afterwards.
Note: the output file needs to comply with the GTF format specification. Additionally it may be the case that the names of chromosomes from different databases differ, e.g. 1 for chromosome 1 from ENSEMBL compared to chr1 for chromosome 1 from UCSC. Since the chromosome names are important for the correct functionality of DCC a sample command for converting the identifiers may be:
.. code-block:: bash
$ sed -i 's/^chr//g' your_repeat_file.gtf
chimeric.out.junction
files samplesheet
file, containing the paths to the jointly mapped chimeric.out.junction
files
.. code-block:: bash
$ cat samplesheet /path/to/STAR/sample_1/joint_mapping/chimeric.out.junction /path/to/STAR/sample_2/joint_mapping/chimeric.out.junction /path/to/STAR/sample_3/joint_mapping/chimeric.out.junction
mate1
file, containing the paths to chimeric.out.junction
files of the separately mapped first read of paired-end data
.. code-block:: bash
$ cat mate2 /path/to/STAR/sample_1_mate1/joint_mapping/chimeric.out.junction /path/to/STAR/sample_2_mate1/joint_mapping/chimeric.out.junction /path/to/STAR/sample_3_mate1/joint_mapping/chimeric.out.junction
mate2
file, containing the paths to chimeric.out.junction
files of the separately mapped first read of paired-end data
.. code-block:: bash
$ cat mate2 /path/to/STAR/sample_1_mate2/joint_mapping/chimeric.out.junction /path/to/STAR/sample_2_mate2/joint_mapping/chimeric.out.junction /path/to/STAR/sample_3_mate2/joint_mapping/chimeric.out.junction
Pre-mapped chimeric.out.junction
files from Westholm et al. data set are part of the DCC distribution
.. code-block:: bash
$
After performing all preparation steps DCC can now be started:
.. code-block:: bash
$ DCC @samplesheet \ # @ is generally used to specify a file name -mt1 @mate1 \ # mate1 file containing the mate1 independently mapped chimeric.junction.out files -mt2 @mate2 \ # mate2 file containing the mate1 independently mapped chimeric.junction.out files -D \ # run in circular RNA detection mode -R [Repeats].gtf \ # regions in this GTF file are masked from circular RNA detection -an [Annotation].gtf \ # annotation is used to assign gene names to known transcripts -Pi \ # run in paired independent mode, i.e. use -mt1 and -mt2 -F \ # filter the circular RNA candidate regions -M \ # filter out candidates from mitochondrial chromosomes -Nr 5 6 \ minimum count in one replicate [1] and number of replicates the candidate has to be detected in [2] -fg \ # candidates are not allowed to span more than one gene -G \ # also run host gene expression -A [Reference].fa \ # name of the fasta genome reference file; must be indexed, i.e. a .fai file must be present
$ DCC @samplesheet -D -R [Repeats].gtf -an [Annotation].gtf -F -M -Nr 5 6 -fg -G -A [Reference].fa
$ DCC @samplesheet -mt1 @mate1 -mt2 @mate2 -D -S -R [Repeats].gtf -an [Annotation].gtf -Pi -F -M -Nr 5 6 -fg
$ DCC -h
Notes:
By default, DCC assumes that the data is stranded. For non-stranded data the -N
flag should be used.
Although not mandatory, we strongly recommend to the -F
filtering step
Output files generated by DCC
The output of DCC consists of the following four files: CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.
CircRNACount: a table containing read counts for circRNAs detected. First three columns are chr, circRNA start, circRNA end. From fourth column on are the circRNA read counts, one sample per column, shown in the order given in your samplesheet.
CircCoordinates: circular RNA annotations in BED format. The columns are chr, start, end, genename, junctiontype (based on STAR; 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT), strand, circRNA region (startregion-endregion), overall regions (the genomic features circRNA coordinates interval covers).
LinearCount: host gene expression count table, same setup with CircRNACount file.
CircSkipJunctions: circSkip junctions. The first three columns are the same as in LinearCount/CircRNACount, the following columns represent the circSkip junctions found for each sample. circSkip junctions are given as chr:start-end:count, e.g. chr1:1787-6949:10. It is possible that for one circRNA multiple circSkip junctions are found due to the fact the the circular RNA may arise from different isoforms. In this case, multiple circSkip junctions are delimited with semicolon. A 0 implies that no circSkip junctions have been found for this circRNA.
Test for host-independently regulated circRNAs with CircTest <https://github.com/dieterich-lab/CircTest>
_
Prerequisites:
CircTest <https://github.com/dieterich-lab/CircTest>
_ package must be installed.. code-block:: R
library(CircTest)
CircRNACount <- read.delim('CircRNACount',header=T) LinearCount <- read.delim('LinearCount',header=T) CircCoordinates <- read.delim('CircCoordinates',header=T)
CircRNACount_filtered <- Circ.filter(circ = CircRNACount, linear = LinearCount, Nreplicates = 6, filter.sample = 6, filter.count = 5, percentage = 0.1 )
CircCoordinates_filtered <- CircCoordinates[rownames(CircRNACount_filtered),] LinearCount_filtered <- LinearCount[rownames(CircRNACount_filtered),]
.. code-block:: R
library(CircTest)
data(Circ) CircRNACount_filtered <- Circ data(Coordinates) CircCoordinates_filtered <- Coordinates data(Linear) LinearCount_filtered <- Linear
.. code-block:: R
test = Circ.test(CircRNACount_filtered, LinearCount_filtered, CircCoordinates_filtered, group=c(rep(1,6),rep(2,6),rep(3,6)) )
View(test$summary_table)
.. code-block:: R
for (i in rownames(test$summary_table)) { Circ.ratioplot(CircRNACount_filtered, LinearCount_filtered, CircCoordinates_filtered, plotrow=i, groupindicator1=c(rep('1days',6),rep('4days',6),rep('20days',6)), lab_legend='Ages' ) }
For further details on the usage of CircTest please refer to the corresponding GitHub project.
Problems, issues, and errors
Create an issue <https://github.com/dieterich-lab/DCC/issues/new>
_-k
flag when reporting an error to keep temporary files important for debugging purposes