dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

filter by non repetitive region #73

Closed yjm1992312 closed 2 years ago

yjm1992312 commented 4 years ago

Dear Sir,I downloaded the RepeatMasker and Simple_Repeats from the UCSC by a windows browser,and then cat RepeatMasker Simple_Repeats>my_Repeats,I counted the row from UCSC,and it was consistent with the original file,so I thougth the gtf of my downloads was complete.But the output.log showed filter by non repetitive region.Could you help me fix it?

I run DCC with 3 samples,each sample has 3 duplicates.They are C1_1,C1_2,C1_3,D1_1,D1_2,D1_3,D2_1,D2_2,D2_3.From the output .log,I found chimeric.out.junction mapping using both mates was analysed as input,for example: 2019-11-19 13:13:02,718 Collecting chimera information from mates-separate mapping 2019-11-19 13:13:55,253 started circRNA detection from file _tmp_DCC/C1_3_Chimeric.out.junction.7OG5R5

I also input my mate1 and mate2 file which contained the chimeric.out.junction mapping using seperate mate1 and mate2.But from the output.log,I can't find the command just like started circRNA detection from file _tmp_DCC/C1_1_1.Chimeric.out.junction or started circRNA detection from file _tmp_DCC/C1_1_2.Chimeric.out.junction.Is it something wrong with my command?Or just the way DCC is that doesn't show the process of analysis the chimeric.out.junction mapping using seperate mate1 and mate2?Could help me fix it? Here is my mate1 file: C1_1_1.Chimeric.out.junction C1_2_1.Chimeric.out.junction C1_3_1.Chimeric.out.junction D1_1_1.Chimeric.out.junction D1_2_1.Chimeric.out.junction D1_3_1.Chimeric.out.junction D2_1_1.Chimeric.out.junction D2_2_1.Chimeric.out.junction D2_3_1.Chimeric.out.junction

Here is my mate2 file: C1_1_2.Chimeric.out.junction C1_2_2.Chimeric.out.junction C1_3_2.Chimeric.out.junction D1_1_2.Chimeric.out.junction D1_2_2.Chimeric.out.junction D1_3_2.Chimeric.out.junction D2_1_2.Chimeric.out.junction D2_2_2.Chimeric.out.junction D2_3_2.Chimeric.out.junction

Here was my command: DCC @/ifs1/User/yanjiamin/raw_data/samplesheet -mt1 @/ifs1/User/yanjiamin/raw_data/mate1 \ -mt2 @/ifs1/User/yanjiamin/raw_data/mate2 -B @/ifs1/User/yanjiamin/raw_data/bam_file_list -D -R /ifs1/User/yanjiamin/raw_data/my_Repeats.gtf \ -an /ifs1/User/yanjiamin/raw_data/gencode.v32.annotation.gtf -Pi -F -M -Nr 5 6 -fg -G \ -A /ifs1/User/yanjiamin/raw_data/GRCh38.primary_assembly.genome.fa -O /ifs1/User/yanjiamin/raw_data/DCC_result&

Here was the output.log: 2019-11-19 13:13:02,709 DCC 0.4.7 started 2019-11-19 13:13:02,709 DCC command line: /ifs1/User/yanjiamin/.local/bin/DCC @/ifs1/User/yanjiamin/raw_data/samplesheet -mt1 @/ifs1/User/yanjiamin/raw_data/mate1 -mt2 @/ifs1/User/yanjiamin/raw_data/mate2 -B @/ifs1/User/yanjiamin/raw_data/bam_file_list -D -R /ifs1/User/yanjiamin/raw_data/my_Repeats.gtf -an /ifs1/User/yanjiamin/raw_data/gencode.v32.annotation.gtf -Pi -F -M -Nr 5 6 -fg -G -A /ifs1/User/yanjiamin/raw_data/GRCh38.primary_assembly.genome.fa -O /ifs1/User/yanjiamin/raw_data/DCC_result1/ 2019-11-19 13:13:02,717 Starting to detect circRNAs 2019-11-19 13:13:02,718 Stranded data mode 2019-11-19 13:13:02,718 Please make sure that the read pairs have been mapped both, combined and on a per mate basis 2019-11-19 13:13:02,718 Collecting chimera information from mates-separate mapping 2019-11-19 13:13:55,253 started circRNA detection from file _tmp_DCC/C1_3_Chimeric.out.junction.7OG5R5 2019-11-19 13:13:55,253 started circRNA detection from file _tmp_DCC/C1_1_Chimeric.out.junction.MHN3RK 2019-11-19 13:22:01,449 finished circRNA detection from file _tmp_DCC/C1_3_Chimeric.out.junction.7OG5R5 2019-11-19 13:22:01,449 started circRNA detection from file _tmp_DCC/D1_1_Chimeric.out.junction.MJ4BE6 2019-11-19 13:22:08,374 finished circRNA detection from file _tmp_DCC/C1_1_Chimeric.out.junction.MHN3RK 2019-11-19 13:22:08,374 started circRNA detection from file _tmp_DCC/C1_2_Chimeric.out.junction.CSN675 2019-11-19 13:25:26,690 finished circRNA detection from file _tmp_DCC/D1_1_Chimeric.out.junction.MJ4BE6 2019-11-19 13:25:26,692 started circRNA detection from file _tmp_DCC/D1_2_Chimeric.out.junction.81H46A 2019-11-19 13:26:19,651 finished circRNA detection from file _tmp_DCC/C1_2_Chimeric.out.junction.CSN675 2019-11-19 13:26:19,651 started circRNA detection from file _tmp_DCC/D2_1_Chimeric.out.junction.CW8ENC 2019-11-19 13:30:39,757 finished circRNA detection from file _tmp_DCC/D1_2_Chimeric.out.junction.81H46A 2019-11-19 13:30:39,757 started circRNA detection from file _tmp_DCC/D1_3_Chimeric.out.junction.QMMV6T 2019-11-19 13:31:33,762 finished circRNA detection from file _tmp_DCC/D2_1_Chimeric.out.junction.CW8ENC 2019-11-19 13:31:33,763 started circRNA detection from file _tmp_DCC/D2_2_Chimeric.out.junction.UUHEOG 2019-11-19 13:36:53,468 finished circRNA detection from file _tmp_DCC/D2_2_Chimeric.out.junction.UUHEOG 2019-11-19 13:36:53,468 started circRNA detection from file _tmp_DCC/D2_3_Chimeric.out.junction.BHY6DI 2019-11-19 13:37:15,969 finished circRNA detection from file _tmp_DCC/D1_3_Chimeric.out.junction.QMMV6T 2019-11-19 13:41:23,769 finished circRNA detection from file _tmp_DCC/D2_3_Chimeric.out.junction.BHY6DI 2019-11-19 13:41:23,770 Combining individual circRNA read counts 2019-11-19 13:42:21,255 Write in annotation 2019-11-19 13:42:21,255 Select gene features in Annotation file 2019-11-19 13:49:57,779 Filtering started 2019-11-19 13:49:57,780 Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering 2019-11-19 13:50:12,025 Filtering by read counts 2019-11-19 13:50:15,396 Filter by non repetitive region 2019-11-19 13:59:45,765 Deleting circRNA candidates from mitochondrial chromosome 2019-11-19 13:59:45,833 Filtering by gene annotation. CircRNA candidates from more than one genes are deleted 2019-11-19 13:59:46,148 Filtering finished 2019-11-19 14:13:05,398 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_C1/C1_3_Aligned.sortedByCoord.out.bam 2019-11-19 14:13:05,407 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_C1/C1_3_Aligned.sortedByCoord.out.bam 2019-11-19 14:13:05,789 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_C1/C1_1_Aligned.sortedByCoord.out.bam 2019-11-19 14:13:05,793 WARNING: circRNA start position ('chr15', '44767231') does not have mapped read counts, treated as 0 2019-11-19 14:13:05,794 WARNING: circRNA end position ('chr19', '44708187') does not have mapped read counts, treated as 0 2019-11-19 14:13:05,797 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_C1/C1_1_Aligned.sortedByCoord.out.bam 2019-11-19 14:24:29,150 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_C1/C1_2_Aligned.sortedByCoord.out.bam 2019-11-19 14:24:29,155 WARNING: circRNA end position ('chr19', '44708187') does not have mapped read counts, treated as 0 2019-11-19 14:24:29,159 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_C1/C1_2_Aligned.sortedByCoord.out.bam 2019-11-19 14:25:42,002 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_D1/D1_1_Aligned.sortedByCoord.out.bam 2019-11-19 14:25:42,006 WARNING: circRNA start position ('chr3', '56592970') does not have mapped read counts, treated as 0 2019-11-19 14:25:42,007 WARNING: circRNA end position ('chr1', '84866138') does not have mapped read counts, treated as 0 2019-11-19 14:25:42,011 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_D1/D1_1_Aligned.sortedByCoord.out.bam 2019-11-19 14:37:31,201 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_D2/D2_1_Aligned.sortedByCoord.out.bam 2019-11-19 14:37:31,206 WARNING: circRNA end position ('chr5', '69311204') does not have mapped read counts, treated as 0 2019-11-19 14:37:31,210 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_D2/D2_1_Aligned.sortedByCoord.out.bam 2019-11-19 14:38:15,795 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_D1/D1_2_Aligned.sortedByCoord.out.bam 2019-11-19 14:38:15,804 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_D1/D1_2_Aligned.sortedByCoord.out.bam 2019-11-19 14:49:32,700 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_D2/D2_2_Aligned.sortedByCoord.out.bam 2019-11-19 14:49:32,705 WARNING: circRNA start position ('chr4', '177353308') does not have mapped read counts, treated as 0 2019-11-19 14:49:32,705 WARNING: circRNA start position ('chr4', '177353308') does not have mapped read counts, treated as 0 2019-11-19 14:49:32,709 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_D2/D2_2_Aligned.sortedByCoord.out.bam 2019-11-19 14:52:52,823 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_D1/D1_3_Aligned.sortedByCoord.out.bam 2019-11-19 14:52:52,832 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_D1/D1_3_Aligned.sortedByCoord.out.bam 2019-11-19 15:00:47,635 Ended linear gene expression counting /ifs1/User/yanjiamin/raw_data/DCC_D2/D2_3_Aligned.sortedByCoord.out.bam 2019-11-19 15:00:47,643 Ended post processing /ifs1/User/yanjiamin/raw_data/DCC_D2/D2_3_Aligned.sortedByCoord.out.bam 2019-11-19 15:00:47,645 Finished linear gene expression counting, start to combine individual sample counts 2019-11-19 15:00:47,825 Finished combine individual linear gene expression counts

yjm1992312 commented 4 years ago

@tjakobi

tjakobi commented 4 years ago

Hi @yjm1992312,

do I understand correctly that you you provided the repeat masker file and you are unsure if it was used?

The message Filter by non repetitive region is supposed to mean "only candidates in NON-repetitive regions are kept. Therefore the filtering should be fine.

On the other hand, you supplied -mt1 and -mt2, therefore those mappings should have been included in the run even if not implicitly specified.

Please let me know if this answers your questions.

Cheers, Tobias

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, By your anwser,I get a comprehensively understanding of the output.log.So the DCC has successfully worked under my command?

I confused the using of Nr parameters.I input the Nr 5 6 which is consisitent with the sample used in https://github.com/dieterich-lab/DCC.But I don't understand why I should set Nr 5 6.And what Nr parameter should I set in most situations?

I get the output files with DCC which contains CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.How could I select the differential expressed circRNA with the output?And I wonder how to normalize the expression of circRNA count by DCC output results.

Best regards, yim1992312

tjakobi commented 4 years ago

Hi @yjm1992312,

-Nr 5 6 would imply that you require a count of at least 5 BSJ reads for each of the 6 replicates. You should definitely adapt that to your experiment, e.g. -Nr 2 3 for eat least 2 BSJ reads in at least 3 samples.

For differential expression you may either use our circtest R package (https://github.com/dieterich-lab/CircTest), or, alternatively, you may use the count matrix for edgeR or DESeq2.

Cheers, Tobias

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, Thank you for your answer. Now I understand the first parameter of Nr which means the count of BSJ. I still don't understand the second parameter of Nr.I look up the meaning from the DCC manual.It refers to the replicates.The replicates means the replicates of my sample?For example,I have three samples ,each sample has three replicates.So I should set the replicates with 3?Do I understand correctly with the second parameter of Nr? best regards, yim1992312

tjakobi commented 4 years ago

Hi @yjm1992312,

the second parameter refers to samples since it doesn't now any sample<>replicate mappings. I should probably update the help text and the default values for this parameter. The parameters only refers to the number of samples with a count > X, independent of type of sample they belong.

Cheers, Tobias

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, So under my situation,the input samples of mine were C1_1,C1_2,C1_3,D1_1,D1_2,D1_3. The second parameter refers to the number of samples.So I should set the scond parameter <=6?or set 6?I think I am still confused. best regards, yjm1992312

tjakobi commented 4 years ago

Hi @yjm1992312,

6 would be really conservative since it would require that a circRNA is detected in all samples. I would go for 3 (i.e. all treated/control samples) or 2 (allows for one "error").

Cheers, Tobias

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, So the Nr should set 2 3 to produce more possible circRNA?When I set Nr 5 6,the output numbers of circRNAs is about 800+. -Nr 5 6 would imply that you require a count of at least 5 BSJ reads for each of the 6 replicates. 5 BSJ reads means five possible circRNAs? best regards, yjm1992312

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, And if I input all of my samples(C1_1,C1_2,C1_3,D1_1,D1_2,D1_3,D2_1,D2_2,D2_3),totally 9 samples, how could I set Nr in 9 samples?That really confuses me when set Nr under different samples.How could I set Nr under different numbers of samples?

tjakobi commented 4 years ago

Hi @tjakobi Tobias, So the Nr should set 2 3 to produce more possible circRNA?When I set Nr 5 6,the output numbers of circRNAs is about 800+. -Nr 5 6 would imply that you require a count of at least 5 BSJ reads for each of the 6 replicates. 5 BSJ reads means five possible circRNAs? best regards, yjm1992312

-Nr 5 6 would require 5 BSJ reads for 6 samples (any 6, there is no assignment to conditions here)

tjakobi commented 4 years ago

Hi @tjakobi Tobias, And if I input all of my samples(C1_1,C1_2,C1_3,D1_1,D1_2,D1_3,D2_1,D2_2,D2_3),totally 9 samples, how could I set Nr in 9 samples?That really confuses me when set Nr under different samples.How could I set Nr under different numbers of samples?

You may set something like -Nr 5 9 requiring 5 reads in all 9 samples (very conservative).

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, I think I get a preliminary understanding of Nr.But the second parameter of Nr should be bigger when inputing more samples?Under my situation,Nr 5 9 is too strictly that with less circRNAs output?So I should shrink the Nr value to get more possible circRNAs depends on the output numbers of circRNAs?So the default Nr 2 5 will not be applied to all situations? best regards, yjm1992312

tjakobi commented 4 years ago

Hi @yjm1992312,

the default will be applied whenever the user does not supply anything else. In your case I'd maybe go for something linke -Nr 5 3, meaning 5 BSJ in at least three samples. This would allow a circRNA allow to be detected even if it's only lowly expressed in one of the three samples.

Cheers, Tobias

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, Thank you for your kind reply.I successfully finish running through DCC.I also process my data through circexplorer2.I found there is no intersection between them.The number of DCC output is about 2000 per sample by setting Nr 5 3.And the number of DCC output is about 10000 per sample,That's so weird.So I'm going to run my data through find_circ program to figure what's the problem in the DCC output. best regards, yjm1992312

tjakobi commented 4 years ago

Hi @yjm1992312,

What is the difference between

The number of DCC output is about 2000 per sample by setting Nr 5 3.

and

And the number of DCC output is about 10000 per sample,

Was the latter run with different -Nr settings?

Cheers, Tobias

yjm1992312 commented 4 years ago

Hi @tjakobi Tobias, I'm sorry that I type wrong. The number of DCC output is about 2000 per sample by setting Nr 5 3.And the number of circexplorer2 output is about 10000 per sample,That's so weird. I used two programs to find the intersection of DCC output and circexplorer2 output.And I think the insersection result between two programs will be more credible.But to my surprise,there is no intersection between the result of DCC and circexplorer2. My data is Ribosome depleted RNA-seq without Rnase R digestion.So I intend to process my data with find_circ program to see if there is intersection between DCC and find_circ or there is intersection between find_circ and circexplorer2? Here is the website of find_circ and circexplorer2 below: https://github.com/marvin-jens/find_circ https://circexplorer2.readthedocs.io/en/latest/

best regards, yjm1992312