dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link
bioinformatics circular-rna computational-biology dcc python

DCC: detect circRNAs from chimeric reads


DCC is a python package intended to detect and quantify circRNAs with high specificity. DCC works with the STAR (Dobin et al., 2013) chimeric.out.junction files which contains chimerically aligned reads including circRNA junction spanning reads.


Installation


DCC depends on pysam, pandas, numpy, and HTSeq. The installation process of DCC will automatically check for the dependencies and install or update missing (Python) packages. Different installation options are available:

1) Download the latest stable DCC release <https://github.com/dieterich-lab/DCC/releases>_

.. code-block:: bash

$ tar -xvf DCC-.tar.gz

$ cd DCC-

$ python setup.py install --user

2) git clone

.. code-block:: bash

$ git clone https://github.com/dieterich-lab/DCC.git

$ cd DCC

$ python setup.py install --user

Check the installation:

.. code-block:: bash

$ DCC --version

If the Python installation binary path [e.g. $HOME/.local/bin for Ubuntu] is not included in your path, it is also possible run DCC directly:

.. code-block:: bash

$ python /scripts/DCC

or even

$ python /DCC/main.py


Usage


The detection of circRNAs from RNAseq data through DCC can be summarised in three steps:


Step by step tutorial with sample data set


In this tutorial, we use the data set from Westholm et al. 2014 <http://www.sciencedirect.com/science/article/pii/S2211124714009310> as an example. The data are paired-end, stranded RiboMinus RNAseq data from Drosophila melanogaster, consisting of samples of 3 developmental stages (1 day, 4 days, and 20 days) collected from the heads. You can download the data from the NCBI SRA (accession number SRP001696 <http://www.ncbi.nlm.nih.gov/sra/?term=SRP001696>).

Mapping of the fastq files with STAR <https://github.com/alexdobin/STAR>_

Note: --alignSJoverhangMin and --chimJunctionOverhangMin should use the same value to make the circRNA expression and linear gene expression level comparable.

.. code-block:: bash

$ mkdir Sample1 $ cd Sample1 $ STAR --runThreadN 10 \ --genomeDir [genome] \ --outSAMtype BAM SortedByCoordinate \ --readFilesIn Sample1_1.fastq.gz Sample1_2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix [sample prefix] \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \

.. code-block:: bash

Create a directory for mate1

$ mkdir mate1 $ cd mate1 $ STAR --runThreadN 10 \ --genomeDir [genome] \ --outSAMtype None \ --readFilesIn Sample1_1.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix [sample prefix] \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --seedSearchStartLmax 30 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \

.. code-block:: bash

Create a directory for mate2

$ mkdir mate2 $ cd mate2 $ STAR --runThreadN 10 \ --genomeDir [genome] \ --outSAMtype None \ --readFilesIn Sample1_2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix [sample prefix] \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --seedSearchStartLmax 30 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \

Detection of circular RNAs from chimeric.out.junction files with DCC

Acquiring suitable GTF files for repeat masking

.. code-block:: bash

Example to convert UCSC identifiers to to ENSEMBL standard

$ sed -i 's/^chr//g' your_repeat_file.gtf

Preparation of files containing the paths to required chimeric.out.junction files

Pre-mapped chimeric.out.junction files from Westholm et al. data set are part of the DCC distribution

.. code-block:: bash

$ /DCC/data/samplesheet # jointly mapped chimeric.junction.out files $ /DCC/data/mate1 # mate1 independently mapped chimeric.junction.out files $ /DCC/data/mate1 # mate2 independently mapped chimeric.junction.out files

Runnning DCC

After performing all preparation steps DCC can now be started:

.. code-block:: bash

Run DCC to detect circRNAs, using Westholm data as example

$ DCC @samplesheet \ # @ is generally used to specify a file name -mt1 @mate1 \ # mate1 file containing the mate1 independently mapped chimeric.junction.out files -mt2 @mate2 \ # mate2 file containing the mate1 independently mapped chimeric.junction.out files -D \ # run in circular RNA detection mode -R [Repeats].gtf \ # regions in this GTF file are masked from circular RNA detection -an [Annotation].gtf \ # annotation is used to assign gene names to known transcripts -Pi \ # run in paired independent mode, i.e. use -mt1 and -mt2 -F \ # filter the circular RNA candidate regions -M \ # filter out candidates from mitochondrial chromosomes -Nr 5 6 \ minimum count in one replicate [1] and number of replicates the candidate has to be detected in [2] -fg \ # candidates are not allowed to span more than one gene -G \ # also run host gene expression -A [Reference].fa \ # name of the fasta genome reference file; must be indexed, i.e. a .fai file must be present

For single end, non-stranded data:

$ DCC @samplesheet -D -R [Repeats].gtf -an [Annotation].gtf -F -M -Nr 5 6 -fg -G -A [Reference].fa

$ DCC @samplesheet -mt1 @mate1 -mt2 @mate2 -D -S -R [Repeats].gtf -an [Annotation].gtf -Pi -F -M -Nr 5 6 -fg

For details on the parameters please refer to the help page of DCC:

$ DCC -h

Notes:


Output files generated by DCC


The output of DCC consists of the following four files: CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.


Test for host-independently regulated circRNAs with CircTest <https://github.com/dieterich-lab/CircTest>_


Prerequisites:

Import of DCC output files into R:

Using user-generated data

.. code-block:: R

library(CircTest)

CircRNACount <- read.delim('CircRNACount',header=T) LinearCount <- read.delim('LinearCount',header=T) CircCoordinates <- read.delim('CircCoordinates',header=T)

CircRNACount_filtered <- Circ.filter(circ = CircRNACount, linear = LinearCount, Nreplicates = 6, filter.sample = 6, filter.count = 5, percentage = 0.1 )

CircCoordinates_filtered <- CircCoordinates[rownames(CircRNACount_filtered),] LinearCount_filtered <- LinearCount[rownames(CircRNACount_filtered),]

Alternatively, the pre-processed Westholm et al. data from CircTest package may be used:

.. code-block:: R

library(CircTest)

data(Circ) CircRNACount_filtered <- Circ data(Coordinates) CircCoordinates_filtered <- Coordinates data(Linear) LinearCount_filtered <- Linear

Test for host-independently regulated circRNAs

Execute the test

.. code-block:: R

test = Circ.test(CircRNACount_filtered, LinearCount_filtered, CircCoordinates_filtered, group=c(rep(1,6),rep(2,6),rep(3,6)) )

Significant result may be shown in a summary table

View(test$summary_table)

Visualisation of significantly, host-independently regulated circRNAs

.. code-block:: R

for (i in rownames(test$summary_table)) { Circ.ratioplot(CircRNACount_filtered, LinearCount_filtered, CircCoordinates_filtered, plotrow=i, groupindicator1=c(rep('1days',6),rep('4days',6),rep('20days',6)), lab_legend='Ages' ) }

For further details on the usage of CircTest please refer to the corresponding GitHub project.


Problems, issues, and errors