algbioi / snowball

GNU General Public License v3.0
3 stars 4 forks source link

Parallel processing issue? #2

Closed elasekness closed 8 years ago

elasekness commented 8 years ago

I am trying to run snowball with the command: python algbioi/ga/run.py -f /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R1.fastq.gz -s /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R2.fastq.gz -m /local-homes/bioinformatics/erica/metagenome_stool/pfam-snowball.hmm -o both_contigs.fna.gz -i 338 -r 250

but it looks like there is an issue with the multiprocessing module I receive this to stdout: This hmmsearch binary will be used: /usr/bin/hmmsearch Using temporary directory: /tmp/snowball_vxNG2N Running on: Ubuntu 12.04 precise (linux2) Using 32 processors Settings: Read length: 250 Insert size: 338 Min. overlap probability: 0.8 Min. overlap length: 0.5 Min. HMM score: 40 Joining paired-end reads into consensus reads, loading reads from: /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R1.fastq.gz /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R2.fastq.gz Traceback (most recent call last): File "algbioi/ga/run.py", line 493, in _main() File "algbioi/ga/run.py", line 468, in _main outAnnot=outAnnot, cleanUp=cleanUp, processors=processors) File "algbioi/ga/run.py", line 175, in mainSnowball maxCpu=comh.MAX_PROC) File "/local-homes/bioinformatics/erica/metagenome_stool/snowball_1_2/algbioi/com/fq.py", line 150, in joinPairEnd retList = parallel.runThreadParallel(taskList, maxCpu) File "/local-homes/bioinformatics/erica/metagenome_stool/snowball_1_2/algbioi/com/parallel.py", line 101, in runThreadParallel retValList.append(taskHandler.get()) File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get raise self._value IndexError: string index out of range

Any ideas on how to resolve this problem? Thank you in advance for your help, Erica

algbioi commented 8 years ago

Thanks for your interest in using Snowball. There is something wrong with reading the input files, could you please copy paste here the first 5 reads of each input file and check that all reads are of the same length 250? I would just like to check that they are in the right format (e.g. that the read names end with "/1" "/2", e.g. a paired-end read is named as @NZ_AIEZ01000035.1-2/1 and @NZ_AIEZ01000035.1-2/2 )

elasekness commented 8 years ago

Thank you for your rapid reply. I modified all sequence names to have a \1 or \2 ending but I did not trim all reads to be 250 bp. For some reason I was assuming that 250 was an average. I will trim to 250 bp and try to rerun. In the meantime, here are the first five reads from my F and R files:

R1: @M70114:17:000000000-ALHVJ:1:1101:16175:1339\1 CTACAATCGTTGTTGCCATGTAGGCCGGATAAGGCGTTTACGCCGCATCCGGCATTTGCTTAACGTAGAGTGAAGATTAACCTATCTGCCCATTCCCCAAATGTTTAACGCGGCGCTTAACTTCTTCCTGCTTACCGGCGTTAAATGGACGTGCATCCGGGCTACCTAAATATCCGCACACGCGGCGAGTCACCGACACACGGGAGGCGTCATGGTTACCACATTTCGGGCAAGTGAAGCCTTTGCTGGG + BBBCCFFFCAAFGGCGGFGGGGHHHHGGAFGHGHHGGGGHHGGGGGGGGHGGGGGHHHFHHGFHGHHHGFGGFGGHHHHHHHHHHHHGGHHHHGHFHHHCFHHHHHFFHGGGGGGGGGGHHHHHGHHHHHECHHHHGCCCFCGCGHHHHHFAEHHFHGGCGGGFFGGGFFGGGGGGGGGFFA;BBFAFFBFFFFFFFFFBABF=?DCFFFFFFFEFFFFFFFE;BB/FFFFF?ABFBBF/.BFFFFB/B. @M70114:17:000000000-ALHVJ:1:1101:15194:1346\1 CTGCACTCACTGCGTGACGTAAGCGGATGGAGTGGCCGGAAACCTCATAGTGACCGCCCACCAGTTGGCCTGCATCGCTTTGTAGCGTACGCGCGGCATTGGCA + CCCCCFFFFFFFGGGGGGGGGGHHGGGGGGHHHHGGHGGGGGHHHHHHHHHHHHHGGGGGGGHHHHHHHHHGHHHHGGHHGHHHHHGGGGGGGGGGGGGHHGHH @M70114:17:000000000-ALHVJ:1:1101:14942:1361\1 CCTTAACGCCATTATATTTATTTAATTGATGACATTAGCATAATCATTCACTAAGTTAATTTATATAGTATCTGCCAAGACACTTATTTATAGTTATTAAAGGCGCGTCCGATTGGTTCACCGGACGCACCTTAAGTACGTTTCCTTGTGTTATAAGAACAGAAGGATCAGCTGTAAAACAGCAATGATGATTTTGATGACCCGTTTAATCAGGTATCGGCAATCAGTCATTCGTTTTTCCTTAAACAAG + AAABBFFB3DAAGGGGGGGGGGHHFHHHHHHHHHHHHHGHHHHHHHHHHHHHGCGHHHFHHHHHFHHGHHHHEHHECHHFHGHHHHHHHHHGHHGFHHHHEHBGGCGGGGGGGGGHHHHHHHGC/@ECGGGHHHGHHHHDHGHHFHHHHHHHEHGB1F1FFBFBGEFFHHHHFHHHFFHFHC0GHFH0=GBGHFHHEHHHDGH-CGBGCGFFHHFHGH.BC?@FGGFBFGFGFFFFGGG/9CFG;00BEB @M70114:17:000000000-ALHVJ:1:1101:14768:1363\1 GCGCACGAGTGGCGATGATCTTTCAGGAACCGATGACCGCCCTCAATCCGACACGTCGAATAGGTCTTCAGATGATGGACGTGATCCGCCATCATCAACCAATAAGTCGTCGGGAAGCCAGAGCTAAAGCGATTGCCCTGCTGGAAGAGATGCAAATCCCGGATGCCGTGGAAGTTATGTCGCGCTATCCGTTTGAGCTTTCAGGTGGTATGCGCC + BABCCCCCCCCCGGGGGGGGGGHHHHHHHHGGGGGHHHGGGGGHHGHHHGGGGGGHHGGGHGHHHHHHHHHHHHHHHHHHGGGHHHHHGGGGHHHHGHHHGHHHHHGGGGGGGGGGGHHGHHHHHHHHHGGGHHGHHHGHHHHHHHHHHHHHHHGHHHHHFGGGGGHGEDGGGGFGGGGGFGGGGGFFFFFFFFFFFFFEBFFFF:FEAFFFFFFF @M70114:17:000000000-ALHVJ:1:1101:15715:1368\1 ATATGAAACCCTTTTATGCCCAGGCTCTGTGTGACTACGCCGGAGGTCTCACCGCCTGCGACAATAAAGCGTGTCACGCCTTCCGCTGCTAACCGCGCCGCTAGTTGAGAAAACAGAGTTTCTACTGCCTGACTGGCTTTTTGTGCACCGTATTGCTGTTGAATTGCTGCCAATGCGTCAGTGCTGGCGGTGGCAAAAACCAGTGGAGCAAGTAC + CCCCCFFFFFCFGGGGGGGGGGHGHHHHGHHHHGHHHHGGGGGGGGGHHHHHHGGGDHFGGGGGHHHHHGGGGGHHHGGGHGHHGGGGGHHHHHGGGGCFGGGGHHHHHHHGGHGHHHHHHHHHGHHHGHHHHHHFHGHGHHGHHHHHHGHHHHHHGHHHHGHGHHGHHHHHGHHHGGGGGGGGGGGGGGABFAFFFFE.;AEBBBBEBEFFFFE

R2: @M70114:17:000000000-ALHVJ:1:1101:16175:1339\2 CCAATACACCGATTGATGAGTGCTACGAGTGTGGCTTTACCGGTGAGTTCGAGTGCACCAGCAAAGGCTTCACTTGCCCGAAATGTGGTAACCATGACGCCTCCCGTGTGTCGGTGACTCGCCGCGTGTGCGGATATTTAGGTAGCCCGGATGCACGTCCATTTAACGCCGGTAAGCAGGAAGAAGTTAAGCGCCGCGTTAAACATTTGGGGAATGGGCAGATAGGTTAATCTTCACTCTACGTTAAGCA + BAB@BFFFFFBBGGGGGGCFGFCHHHGGGGHHBEHFHGHHHGGCEGHFHHFHHHHHEGGHHGHHHHEHHF2EFGFHHHGGGGGHHHFGGHHHHHHHGHCFEFGHGGHGHHEHGG /EGFFFFGGGGC@CAG1 ?A>CDGGHHGFFHHHHHF?CFCHFGGHFGCGHFF0GDC?DGGAEFGF/99BFFBFFFFFF/BC9BC;@ =F?9B/BBFFFF-.9.;9...9AB/BFF///BBF9/BB/;99/.9BBEF/9 @M70114:17:000000000-ALHVJ:1:1101:15194:1346\2 TGCCAATGCCGCGCGTACGCTACAAAGCGATGCAGGCCAACTGGTGGGCGGTCACTATGAGGTTTCCGGCCACTCCATCCGCTTACGTCACGCAGTGAGTGCAG + CCCCCFFFFFCDGGGGGGGGGGGGGHHHGGHGGHHHGGHGFHHHHHGGGGGGGGHHHHHHHHGGHHHGGGGGGHHHHHHHGGGGGHGHHGGGHGGGHHHHHHHH @M70114:17:000000000-ALHVJ:1:1101:14942:1361\2 ATCTTTTACAAGCAACTTGCAATCTTTAGCATAAAAACTCGAGCCTTTACGAAGAAAGCAATATTGATGGAAAGATTAACGTGACCGCCAATTCGTAAGTACATTAAAATTGGCTTCGTTATTGAAGATTTTGCTGTGCTTTACACCATGCCACAGAATTCCCCCATTGAAACGAGTGGTGTCGTCACAGCTCTGGTGTGGAGTGCAGCATGCACCCTCAATCACTCGCACGTTCAGTTTTGGGGAAGTT + ABAAAFFFFFFFGGGGGFGGGGFHHHHHHGFHHEGCHGHHGHEFGGHHBGGGHG?GHHHHHHHHHEAGHHGFFCHHHHHHGHHHGHGEGFFGHFGHHGHHHHHHHGFFHHBD3BCGHCGHHHHHGDFHHFHBDGGHHGEFGHHFDDGADG2@FHFHFGHHFBD2 //?GFHG1?C/.<G.=<<1<<FG.0=.CGGHBGB<EA0CCGFHB/:CFBGB/:;9AC00;/CBE-A9@FFGEFF09BA ?.;...9/ @M70114:17:000000000-ALHVJ:1:1101:14768:1363\2 GGCGCATACCACCTGAAAGCTCAAACGGATAGCGCGACATAACTTCCACGGCATCCGGGATTTGCATCTCTTCCAGCAGGGCAATCGCTTTAGCTCTGGCTTCCCGACGACTTATTGGTTGATGATGGCGGATCACGTCCATCATCTGAAGACCTATTCGACGTGTCGGATTGAGGGCGGTCATCGGTTCCTGAAAGATCATCGCCCCTCGTGCGC + ABCCCCCCFFFFGGGGGGGGGGHHGHGGGGGHHGGGGGGGHHHHHHHHHGGGGGHHGGGGGHHHHHHFHHHHHHHHHHHGGGGGHFGGHHGHHHFHHHGHHHHHHGDGGCGGDFFHHHHGGHHHHHHHFGCGGGGHGHHGFHGGGG0=DGEGHHHHGGHBBEGDA//EGGBEGEBFFGGFABDFFBDADEFFFFFFFFFFFFFEF..-9:-AE.9= @M70114:17:000000000-ALHVJ:1:1101:15715:1368\2 GTACTTGCTCCACTGGTTTTTGCCACCGCCAGCACTGACGCATTGGCAGCAATTCAACAGCAATACGGTGCACAAAAAGCCAGTCAGGCAGTAGAAACTCTGTTTTCTCAACTAGCGGCGCGGTTAGCAGCGGAAGGCGTGACACGCTTTATTGTCGCAGGCGGTGAGACCTCCGGCGTAGTCACACAGAGCCTGGGCATAAAAGGGTTTCATATCTTTCTTCTATACACATCTGACGCTGCCGAAGAAA + BCBBBFFFFFFFGGGGGGGGGGHHHHHGGGGGGHHHHHHGGGGGHFGGGGFFHHHHHHGHHHHGHHHGHHGHHHBFHEEHHHHHEHHHHHGHHHFBDFFHFHBFGHEHHHFFHGGGCG/BCGGD/CGFHHHGGGFADHCEGGGF<DFGGFHHHHHGG.ADCCDC--..09.F0C??-@---A//;F/..B//9/AE..;;/:F/B.:..;B/99:////////;/;///://;9///......---9.//

On Tue, Sep 13, 2016 at 3:26 PM, algbioi notifications@github.com wrote:

Thanks for your interest in using Snowball. There is something wrong with reading the input files, could you please copy paste here the first 5 reads of each input file and check that all reads are of the same length 250? I would just like to check that they are in the right format (e.g. that the read names end with "/1" "/2", e.g. a paired-end read is named as @NZ_AIEZ01000035.1-2/1 and @NZ_AIEZ01000035.1-2/2 )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/algbioi/snowball/issues/2#issuecomment-246795319, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJqnSx9pWFoJzY-Uia1O2SXzVkzhFJsks5qpvjIgaJpZM4J7-qZ .

Erica Lasek-Nesselquist Assistant Professor University of Scranton Scranton, PA 18510 email: erica.lasek-nesselquist@scranton.edu, elasekness@gmail.com

algbioi commented 8 years ago

Thanks! The read names looks good. Yes, unfortunately, one has to trim all the reads to the same size (and probably filter out short reads). Nevertheless, the longer the reads, the better, the algorithm can handle bases with low quality scores that are typically at the end of the reads.

elasekness commented 8 years ago

Thanks for the clarification. I'll let you know how it goes.

On Tue, Sep 13, 2016 at 5:01 PM, algbioi notifications@github.com wrote:

Thanks! The read names looks good. Yes, unfortunately, one has to trim all the reads to the same size (and probably filter out short reads). Nevertheless, the longer the reads, the better, the algorithm can handle bases with low quality scores that are typically at the end of the reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/algbioi/snowball/issues/2#issuecomment-246822868, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJqncMoIoj23oJ-x4Ukggu3mAA1ZHwTks5qpw8VgaJpZM4J7-qZ .

Erica Lasek-Nesselquist Assistant Professor University of Scranton Scranton, PA 18510 email: erica.lasek-nesselquist@scranton.edu, elasekness@gmail.com