dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

Fail at 'Filter by non repetitive region' #53

Open bioysu opened 5 years ago

bioysu commented 5 years ago

Dir Sir, I tried to run DCC, but I failed every time at step 'Filter by non repetitive region'. Could you help me to fix it? Here is the command: /data/suyao/tools/build/bin/DCC @/data/suyao/project/circRNA/output/DCC/samplesheet_test -mt1 @/data/suyao/project/circRNA/output/DCC/mate1_test -mt2 @/data/suyao/project/circRNA/output/DCC/mate2_test -D -R /data/suyao/data/UCSC/hg19/database/hg19_repeat.gtf -an /data/suyao/data/ensembl/grch37/release-92/Homo_sapiens.GRCh37.87.chr_patch_hapl_scaff_chr.gtf -k -Pi -F -M -Nr 5 3 -fg -G -A /data/suyao/data/UCSC/hg19/chromosomes/hg19.fa -O /data/suyao/project/circRNA/output/DCC/output_test

Here is the output on the screen: Output folder /data/suyao/project/circRNA/output/DCC/output_test already exists, reusing DCC 0.4.6 started Input file names have duplicates, add number suffix in input order to output files for distinction 40 CPU cores available, using 2 Please make sure that the read pairs have been mapped both, combined and on a per mate basis Collecting chimera information from mates-separate mapping Combining individual circRNA read counts Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering Filtering by read counts Traceback (most recent call last): File "/data/suyao/tools/build/bin/DCC", line 11, in load_entry_point('DCC==0.4.6', 'console_scripts', 'DCC')() File "build/bdist.linux-x86_64/egg/DCC/main.py", line 349, in main File "build/bdist.linux-x86_64/egg/DCC/circFilter.py", line 92, in filter_nonrep File "build/bdist.linux-x86_64/egg/DCC/circFilter.py", line 85, in read_rep_region File "/data/suyao/tools/build/lib/python2.7/site-packages/HTSeq-0.10.0-py2.7-linux-x86_64.egg/HTSeq/init.py", line 210, in iter strand, frame, attributeStr) = line.split("\t", 8) ValueError: need more than 1 value to unpack started circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK => separating duplicates [_tmp_DCC/Chimeric.out.junction.HF3HYK] Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count. Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count. Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count. => locating small circRNAs [_tmp_DCC/Chimeric.out.junction.HF3HYK] => locating circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.HF3HYK] => merging circRNAs [_tmp_DCC/Chimeric.out.junction.HF3HYK] => sorting circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.HF3HYK] finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK started circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9 => separating duplicates [_tmp_DCC/Chimeric.out.junction.L546H9] Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count. Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count. Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count. Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count. Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count. Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count. => locating small circRNAs [_tmp_DCC/Chimeric.out.junction.L546H9] => locating circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.L546H9] => merging circRNAs [_tmp_DCC/Chimeric.out.junction.L546H9] => sorting circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.L546H9] finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY => separating duplicates [_tmp_DCC/Chimeric.out.junction.14SVFY] Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count. Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count. Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count. Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count. Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count. Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count. Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count. Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count. Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count. => locating small circRNAs [_tmp_DCC/Chimeric.out.junction.14SVFY] => locating circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.14SVFY] => merging circRNAs [_tmp_DCC/Chimeric.out.junction.14SVFY] => sorting circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.14SVFY] finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY

Here is the log file in the outputfile:

2018-08-06 19:02:47,705 DCC 0.4.6 started 2018-08-06 19:02:47,705 DCC command line: /data/suyao/tools/build/bin/DCC @/data/suyao/project/circRNA/output/DCC/samplesheet_test -mt1 @/data/suyao/project/circRNA/output/DCC/mate1_test -mt2 @/data/suyao/project/circRNA/output/DCC/mate2_test -D -R /data/suyao/data/UCSC/hg19/database/hg19_repeat.gtf -an /data/suyao/data/ensembl/grch37/release-92/Homo_sapiens.GRCh37.87.chr_patch_hapl_scaff_chr.gtf -k -Pi -F -M -Nr 5 3 -fg -G -A /data/suyao/data/UCSC/hg19/chromosomes/hg19.fa -O /data/suyao/project/circRNA/output/DCC/output_test 2018-08-06 19:02:47,705 Input file names have duplicates, add number suffix in input order to output files for distinction 2018-08-06 19:02:47,713 Starting to detect circRNAs 2018-08-06 19:02:47,713 Stranded data mode 2018-08-06 19:02:47,713 Please make sure that the read pairs have been mapped both, combined and on a per mate basis 2018-08-06 19:02:47,713 Collecting chimera information from mates-separate mapping 2018-08-06 19:03:20,618 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK 2018-08-06 19:03:20,618 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9 2018-08-06 19:04:34,465 Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count. 2018-08-06 19:04:34,469 Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count. 2018-08-06 19:05:24,760 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count. 2018-08-06 19:05:24,764 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count. 2018-08-06 19:05:24,777 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count. 2018-08-06 19:05:24,781 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count. 2018-08-06 19:10:16,099 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count. 2018-08-06 19:10:16,110 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count. 2018-08-06 19:11:09,119 Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count. 2018-08-06 19:11:56,154 finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9 2018-08-06 19:11:56,154 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY 2018-08-06 19:12:24,298 Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count. 2018-08-06 19:12:24,303 Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count. 2018-08-06 19:12:56,735 Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count. 2018-08-06 19:12:56,740 Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count. 2018-08-06 19:13:22,486 finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK 2018-08-06 19:16:28,007 Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. 2018-08-06 19:16:28,013 Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. 2018-08-06 19:18:20,494 Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count. 2018-08-06 19:18:20,499 Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count. 2018-08-06 19:21:44,266 Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count. 2018-08-06 19:21:57,792 Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count. 2018-08-06 19:23:01,357 Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count. 2018-08-06 19:24:37,163 Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count. 2018-08-06 19:25:01,135 finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY 2018-08-06 19:25:01,135 Combining individual circRNA read counts 2018-08-06 19:25:16,451 Write in annotation 2018-08-06 19:25:16,451 Select gene features in Annotation file 2018-08-06 19:30:00,390 Filtering started 2018-08-06 19:30:00,390 Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering 2018-08-06 19:30:02,196 Filtering by read counts 2018-08-06 19:30:02,862 Filter by non repetitive region

tjakobi commented 5 years ago

Dear @bioysu,

thank you for your feedback! Would it be possible for you to upload the repetitive region file you specified on the command line in some way? It seems that the HTSeq library, which parses the GTF file has trouble with its syntax. I'd like to verify that the hg19_repeat.gtf file is valid.

Cheers, Tobias

bioysu commented 5 years ago

hg19_repeat.gtf is download from UCSC genome browser. Here is the head of hg19_repeat.gtf: chr1 hg19_rmsk exon 16777161 16777470 2147.000000 + . gene_id "AluSp"; transcript_id "AluSp"; chr1 hg19_rmsk exon 25165801 25166089 2626.000000 - . gene_id "AluY"; transcript_id "AluY"; chr1 hg19_rmsk exon 33553607 33554646 626.000000 + . gene_id "L2b"; transcript_id "L2b"; chr1 hg19_rmsk exon 50330064 50332153 12545.000000 + . gene_id "L1PA10"; transcript_id "L1PA10"; chr1 hg19_rmsk exon 58720068 58720973 8050.000000 - . gene_id "L1PA2"; transcript_id "L1PA2"; chr1 hg19_rmsk exon 75496181 75498100 10586.000000 + . gene_id "L1MB7"; transcript_id "L1MB7";

tjakobi commented 5 years ago

Dear @bioysu,

could you count the line of the file and and generate an md5sum? Using the same file, DCC does not produce the error here. Maybe the download was incomplete?

Cheers, Tobias