dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

duplicated circRNAs #83

Closed Eteleeb closed 2 years ago

Eteleeb commented 4 years ago

Hi,

I have run DCC successfully with my paired-end stranded data but I noticed that some circRNAs are repeated one time as annotated with the host gene and one time as "not_annotated". Here is an example:

chr1 1223244 1223968 SDF4 2 - exon-exon transcript,gene,exon,CDS chr1 1223244 1223968 not_annotated 1 + intergenic-intergenic not_annotated

I thought may be the problem of not enabling the parameter "-ss". I included it but the result is still the same. First, how to use the "-ss" parameter for firststrand data?. Second, why I am getting this duplicated results with the same coordinates?.

Here is my command:

DCC @$path_to_samplesheet \
    -mt1 @$path_to_R1 \
    -mt2 @$path_to_R2 \
    -T 20 -D -Pi -F -M -Nr 1 1 -fg -G -ss -O $dcc_dir -t $dcc_dir/tmp \
    -R $ref_dir/GRCh38_Repeats_simpleRepeats_RepeatMasker.gtf \
    -an $ref_dir/gencode.v33.GRCh38.annotation.ERCC92.gtf \
    -A $ref_dir/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
    -B @$path_to_bams

Thank you.

-Abdallah

tjakobi commented 4 years ago

Dear @Eteleeb,

The -ss parameter is specific for RNA-seq data produced with second-strand libraries. See https://www.biostars.org/p/64250/ for some information about that topic.

The duplicate is actually not a duplicate. DCC found one circRNA for the annotated gene on the annotated strand - but also a possible circRNA candidate on the antisense strand. This happens regularly, but should not occur too often (i.e. if you have most of your circRNAs annotated as not_annotated there is something wrong with the DCC settings - that would be a case for the -ss option).

Cheers, Tobias

Eteleeb commented 4 years ago

Thank you Tobias for the clarification. So, it possible the same circRNA with the exact start-end position to be on both strands. I didn't check many but I saw this situation in a few cases. The issue is that more than 50% of the circRNAs detected are classified as "not_annotated" which concerns me. Out of 9,158 circRNA candidates, only 3,720 were annotated circRNAs. Any thoughts why this happened?. I tried "-ss" with only two samples and I saw the same example as the one I included above but didn't run with "-ss" on everything. Do you think I should use the "-ss" parameter and run on everything?. I am not sure if my library was first-strand or second-strand but for sure it is strand-specific. Thank you.

-Abdallah

tjakobi commented 4 years ago

Hi @Eteleeb,

I'd give it a try to rerun everything with -ss.

If that does not work we will see what else can be done.

Cheers, Tobias

Eteleeb commented 4 years ago

I know that we used TruSeq Stranded Total RNA Sample Prep with Ribo-Zero Gold kit (Illumina) for our library and according to this:

The following list gives an overview of common sequencing kits and the respective parameter choice: First-strand kits (default):

● All dUTP methods, NSR, NNSR
● TruSeq Stranded Total RNA Sample Prep Kit
● TruSeq Stranded mRNA Sample Prep Kit
● NEB Ultra Directional RNA Library Prep Kit
● Agilent SureSelect Strand-Specific

Second-strand kits (second-strand parameter -ss has to be used):

● Directional Illumina (Ligation), Standard SOLiD
● ScriptSeq v2 RNA-Seq Library Preparation Kit
● SMARTer Stranded Total RNA
● Encore Complete RNA-Seq Library Systems

Probably I shouldn't use "-ss", right?

tjakobi commented 4 years ago

Yeah, in that case, -ss should not be used. But it seems to be stranded data, and your are not using -N. So from the parameters everything looks okay. Anyway, I would still run -ss in a second run, just to be able to compare.

Eteleeb commented 4 years ago

Thank you Tobias. One final question, We are planning to include DCC within the implementation of our pipeline and I was wondering if it is possible to run DCC sample-by-sample (with -Nr 1 1) and then combine and filter the results. Our pipeline is a sample-specific and runs sample-by-sample. This would provide us two advantages, (1) DCC will be run immediately on each sample we process within the pipeline, (2) we think that this will have a significant reduction of the amount of time DCC takes to process all samples in a combined mode. Is this something can be done?. Thank you.

-A

tjakobi commented 4 years ago

While this is not directly supported, I also did it from time to time. There might be small differences between running in N instances instead of 1, related to some filtering steps. But in general you should receive a similar picture.

However, if you deploy this as a pipeline, I would be good to test it once to have a direct comparison between N and 1 run.

Than a diff to see where the differences are.

Eteleeb commented 4 years ago

Thank you for the information. Yes, I have run DCC on five separate samples and then run it combined. I am trying to write my own scripts to combine the results but if was wondering if I can use "CombineCounts.py" directly from your scripts. Thank you.

-A

tjakobi commented 4 years ago

I didn't use that script in a while, but you should give it try.