dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

IndexError: list index out of range #31

Closed cmonger closed 7 years ago

cmonger commented 7 years ago

Hi all,

I am getting the following error message when trying to run DCC for use in FUCHs (with the suggested parameters)

DCC 0.4.4 started
Output folder ./ already exists, reusing
Temporary folder _tmp_DCC/ already exists, reusing
32 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
Traceback (most recent call last):
  File "/mnt/work/craig/.local/bin/DCC", line 9, in <module>
    load_entry_point('DCC==0.4.4', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 215, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 489, in fixall
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 80, in fixchimerics
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 51, in fixmate2
IndexError: list index out of range

This seems to be a similar issue to https://github.com/dieterich-lab/DCC/issues/7

The command line used was: ~/.local/bin/DCC samplesheet.txt -mt1 mate1.txt -mt2 mate2.txt -D -R ../FUCHS/repeatsUCSC.gtf -an /mnt/work/index/cGriseus/ucsc_C_griseus_v1.0/ucsc_allmRNA_genenames.gtf -Pi -F -M -Nr 5 6 -fg -G -A /mnt/work/index/cGriseus/ucsc_C_griseus_v1.0/criGri1.fa

I have also tried calling the main script directly with:

python ../FUCHS/DCC-0.4.4/DCC/main.py samplesheet.txt -mt1 mate1.txt -mt2 mate2.txt -D -R ../FUCHS/repeatsUCSC.gtf -an /mnt/work/index/cGriseus/ucsc_C_griseus_v1.0/ucsc_allmRNA_genenames.gtf  -Pi -F -M -Nr 5 6 -fg -G -A /mnt/work/index/cGriseus/ucsc_C_griseus_v1.0/criGri1.fa
DCC 0.4.4 started

and get the same error.

I am running the software in a python environment (2.7.10) with

DCC==0.4.4
HTSeq==0.6.1
numpy==1.11.1
pandas==0.18.1
pysam==0.9.1.4
python-dateutil==2.5.3
pytz==2016.6.1
six==1.10.0
tjakobi commented 7 years ago

Thanks for the feedback @cmonger, I'll have a look at the error asap.

tjakobi commented 7 years ago

I obtained an internal data set producing the same error and will now try to fix the error.

tjakobi commented 7 years ago

Dear @cmonger , could you please checkout the latest commit (89762f5f96fa3df7647366f0123eab6daff1f995) and try again? The commit includes a check for junction files which has been the cause for the problem with our in-house data set.

cmonger commented 7 years ago
git show
commit b64079b3a4d4964876c156fabd3a582c058a60c7
Merge: 89762f5 cb12793
Author: Tobias Jakobi <tobias.jakobi@med.uni-heidelberg.de>
Date:   Wed Sep 28 16:16:01 2016 +0200

    Merge branch 'master' of github.com:dieterich-lab/DCC

Using command:

~/.local/bin/DCC samplesheet.txt -mt1 mate1.txt -mt2 mate2.txt -D -R ../FUCHS/repeatsUCSC.gtf -an /mnt/work/index/cGriseus/ucsc_C_griseus_v1.0/ucsc_allmRNA_genenames.gtf -Pi -F -M -Nr 5 6 -fg -G -A /mnt/work/index/cGriseus/ucsc_C_griseus_v1.0/criGri1.fa

I get the same error:

`Output folder ./ already exists, reusing
Temporary folder _tmp_DCC/ already exists, reusing
DCC 0.4.4 started
32 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
Traceback (most recent call last):
  File "/mnt/work/craig/.local/bin/DCC", line 9, in <module>
    load_entry_point('DCC==0.4.4', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 218, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 492, in fixall
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 80, in fixchimerics
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 51, in fixmate2
IndexError: list index out of range
`

I also inspected each chimeric junction file to make sure they are not corrupt and the paths specified in the samplesheet/mate files were correct.

tjakobi commented 7 years ago

Thank you very much for your response. I'll further look into that this issue then and try to provide a patch soon.

tjakobi commented 7 years ago

Would you mind checking out commit c7f822b92e5fd45e18ad13dfcf4de2e082713962? It contains a simple check to print out the line in case the parsing fails. This should help to track down the error.

cmonger commented 7 years ago

Besides changes to the line numbers in the error message, I still get the same error!

Output folder ./ already exists, reusing
Temporary folder _tmp_DCC/ already exists, reusing
DCC 0.4.4 started
32 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
Traceback (most recent call last):
  File "/mnt/work/craig/.local/bin/DCC", line 9, in <module>
    load_entry_point('DCC==0.4.4', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 220, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 494, in fixall
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 91, in fixchimerics
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 62, in fixmate2
IndexError: list index out of range
tjakobi commented 7 years ago

Argh! Maybe it makes more sense if I check the correct variable... Please try again with commit 215548334d8d3a12dab9d017432b92008da2a6e4. I'm sorry for the unnecessary run.

cmonger commented 7 years ago
`WARNING: File mate2.txt, line 1 does not contain all features.
WARNING: mate2.txt is probably corrupt.
WARNING: Offending line: /mnt/work/craig/FUCHS/analysis/37DegreesRep1/mate2/37DegreesRep1_2Chimeric.out.junction
`

I am unsure why this error is occurring as the file does has 14 fields! I will tar up the chimeric junction files etc. so you can instigate further. Ill be in touch by email shortly with a download link.

tjakobi commented 7 years ago

Hehe. After looking closer at the command line you supplied I spotted the error:

-mt1 mate1.txt -mt2 mate2.txt should be -mt1 @mate1.txt -mt2 @mate2.txt. That way Python returns the lines directly as list which is then used by DCC. In your case DCC tries to parse the mate file itself as junction files which - of course - fails.

I'll think about some way to catch this illegal command line before the main program starts up.

cmonger commented 7 years ago

Haha of course it was something silly. I have not come across this syntax before and assumed the @ was used as notation for the user to specify their file name!

Apologies for unnecessary testing! I look forward to seeing the results.