Closed elasekness closed 8 years ago
Thanks for your interest in using Snowball. There is something wrong with reading the input files, could you please copy paste here the first 5 reads of each input file and check that all reads are of the same length 250? I would just like to check that they are in the right format (e.g. that the read names end with "/1" "/2", e.g. a paired-end read is named as @NZ_AIEZ01000035.1-2/1 and @NZ_AIEZ01000035.1-2/2 )
Thank you for your rapid reply. I modified all sequence names to have a \1 or \2 ending but I did not trim all reads to be 250 bp. For some reason I was assuming that 250 was an average. I will trim to 250 bp and try to rerun. In the meantime, here are the first five reads from my F and R files:
R1: @M70114:17:000000000-ALHVJ:1:1101:16175:1339\1 CTACAATCGTTGTTGCCATGTAGGCCGGATAAGGCGTTTACGCCGCATCCGGCATTTGCTTAACGTAGAGTGAAGATTAACCTATCTGCCCATTCCCCAAATGTTTAACGCGGCGCTTAACTTCTTCCTGCTTACCGGCGTTAAATGGACGTGCATCCGGGCTACCTAAATATCCGCACACGCGGCGAGTCACCGACACACGGGAGGCGTCATGGTTACCACATTTCGGGCAAGTGAAGCCTTTGCTGGG + BBBCCFFFCAAFGGCGGFGGGGHHHHGGAFGHGHHGGGGHHGGGGGGGGHGGGGGHHHFHHGFHGHHHGFGGFGGHHHHHHHHHHHHGGHHHHGHFHHHCFHHHHHFFHGGGGGGGGGGHHHHHGHHHHHECHHHHGCCCFCGCGHHHHHFAEHHFHGGCGGGFFGGGFFGGGGGGGGGFFA;BBFAFFBFFFFFFFFFBABF=?DCFFFFFFFEFFFFFFFE;BB/FFFFF?ABFBBF/.BFFFFB/B. @M70114:17:000000000-ALHVJ:1:1101:15194:1346\1 CTGCACTCACTGCGTGACGTAAGCGGATGGAGTGGCCGGAAACCTCATAGTGACCGCCCACCAGTTGGCCTGCATCGCTTTGTAGCGTACGCGCGGCATTGGCA + CCCCCFFFFFFFGGGGGGGGGGHHGGGGGGHHHHGGHGGGGGHHHHHHHHHHHHHGGGGGGGHHHHHHHHHGHHHHGGHHGHHHHHGGGGGGGGGGGGGHHGHH @M70114:17:000000000-ALHVJ:1:1101:14942:1361\1 CCTTAACGCCATTATATTTATTTAATTGATGACATTAGCATAATCATTCACTAAGTTAATTTATATAGTATCTGCCAAGACACTTATTTATAGTTATTAAAGGCGCGTCCGATTGGTTCACCGGACGCACCTTAAGTACGTTTCCTTGTGTTATAAGAACAGAAGGATCAGCTGTAAAACAGCAATGATGATTTTGATGACCCGTTTAATCAGGTATCGGCAATCAGTCATTCGTTTTTCCTTAAACAAG + AAABBFFB3DAAGGGGGGGGGGHHFHHHHHHHHHHHHHGHHHHHHHHHHHHHGCGHHHFHHHHHFHHGHHHHEHHECHHFHGHHHHHHHHHGHHGFHHHHEHBGGCGGGGGGGGGHHHHHHHGC/@ECGGGHHHGHHHHDHGHHFHHHHHHHEHGB1F1FFBFBGEFFHHHHFHHHFFHFHC0GHFH0=GBGHFHHEHHHDGH-CGBGCGFFHHFHGH.BC?@FGGFBFGFGFFFFGGG/9CFG;00BEB @M70114:17:000000000-ALHVJ:1:1101:14768:1363\1 GCGCACGAGTGGCGATGATCTTTCAGGAACCGATGACCGCCCTCAATCCGACACGTCGAATAGGTCTTCAGATGATGGACGTGATCCGCCATCATCAACCAATAAGTCGTCGGGAAGCCAGAGCTAAAGCGATTGCCCTGCTGGAAGAGATGCAAATCCCGGATGCCGTGGAAGTTATGTCGCGCTATCCGTTTGAGCTTTCAGGTGGTATGCGCC + BABCCCCCCCCCGGGGGGGGGGHHHHHHHHGGGGGHHHGGGGGHHGHHHGGGGGGHHGGGHGHHHHHHHHHHHHHHHHHHGGGHHHHHGGGGHHHHGHHHGHHHHHGGGGGGGGGGGHHGHHHHHHHHHGGGHHGHHHGHHHHHHHHHHHHHHHGHHHHHFGGGGGHGEDGGGGFGGGGGFGGGGGFFFFFFFFFFFFFEBFFFF:FEAFFFFFFF @M70114:17:000000000-ALHVJ:1:1101:15715:1368\1 ATATGAAACCCTTTTATGCCCAGGCTCTGTGTGACTACGCCGGAGGTCTCACCGCCTGCGACAATAAAGCGTGTCACGCCTTCCGCTGCTAACCGCGCCGCTAGTTGAGAAAACAGAGTTTCTACTGCCTGACTGGCTTTTTGTGCACCGTATTGCTGTTGAATTGCTGCCAATGCGTCAGTGCTGGCGGTGGCAAAAACCAGTGGAGCAAGTAC + CCCCCFFFFFCFGGGGGGGGGGHGHHHHGHHHHGHHHHGGGGGGGGGHHHHHHGGGDHFGGGGGHHHHHGGGGGHHHGGGHGHHGGGGGHHHHHGGGGCFGGGGHHHHHHHGGHGHHHHHHHHHGHHHGHHHHHHFHGHGHHGHHHHHHGHHHHHHGHHHHGHGHHGHHHHHGHHHGGGGGGGGGGGGGGABFAFFFFE.;AEBBBBEBEFFFFE
R2: @M70114:17:000000000-ALHVJ:1:1101:16175:1339\2 CCAATACACCGATTGATGAGTGCTACGAGTGTGGCTTTACCGGTGAGTTCGAGTGCACCAGCAAAGGCTTCACTTGCCCGAAATGTGGTAACCATGACGCCTCCCGTGTGTCGGTGACTCGCCGCGTGTGCGGATATTTAGGTAGCCCGGATGCACGTCCATTTAACGCCGGTAAGCAGGAAGAAGTTAAGCGCCGCGTTAAACATTTGGGGAATGGGCAGATAGGTTAATCTTCACTCTACGTTAAGCA + BAB@BFFFFFBBGGGGGGCFGFCHHHGGGGHHBEHFHGHHHGGCEGHFHHFHHHHHEGGHHGHHHHEHHF2EFGFHHHGGGGGHHHFGGHHHHHHHGHCFEFGHGGHGHHEHGG /EGFFFFGGGGC@CAG1 ?A>CDGGHHGFFHHHHHF?CFCHFGGHFGCGHFF0GDC?DGGAEFGF/99BFFBFFFFFF/BC9BC;@ =F?9B/BBFFFF-.9.;9...9AB/BFF///BBF9/BB/;99/.9BBEF/9 @M70114:17:000000000-ALHVJ:1:1101:15194:1346\2 TGCCAATGCCGCGCGTACGCTACAAAGCGATGCAGGCCAACTGGTGGGCGGTCACTATGAGGTTTCCGGCCACTCCATCCGCTTACGTCACGCAGTGAGTGCAG + CCCCCFFFFFCDGGGGGGGGGGGGGHHHGGHGGHHHGGHGFHHHHHGGGGGGGGHHHHHHHHGGHHHGGGGGGHHHHHHHGGGGGHGHHGGGHGGGHHHHHHHH @M70114:17:000000000-ALHVJ:1:1101:14942:1361\2 ATCTTTTACAAGCAACTTGCAATCTTTAGCATAAAAACTCGAGCCTTTACGAAGAAAGCAATATTGATGGAAAGATTAACGTGACCGCCAATTCGTAAGTACATTAAAATTGGCTTCGTTATTGAAGATTTTGCTGTGCTTTACACCATGCCACAGAATTCCCCCATTGAAACGAGTGGTGTCGTCACAGCTCTGGTGTGGAGTGCAGCATGCACCCTCAATCACTCGCACGTTCAGTTTTGGGGAAGTT + ABAAAFFFFFFFGGGGGFGGGGFHHHHHHGFHHEGCHGHHGHEFGGHHBGGGHG?GHHHHHHHHHEAGHHGFFCHHHHHHGHHHGHGEGFFGHFGHHGHHHHHHHGFFHHBD3BCGHCGHHHHHGDFHHFHBDGGHHGEFGHHFDDGADG2@FHFHFGHHFBD2 //?GFHG1?C/.<G.=<<1<<FG.0=.CGGHBGB<EA0CCGFHB/:CFBGB/:;9AC00;/CBE-A9@FFGEFF09BA ?.;...9/ @M70114:17:000000000-ALHVJ:1:1101:14768:1363\2 GGCGCATACCACCTGAAAGCTCAAACGGATAGCGCGACATAACTTCCACGGCATCCGGGATTTGCATCTCTTCCAGCAGGGCAATCGCTTTAGCTCTGGCTTCCCGACGACTTATTGGTTGATGATGGCGGATCACGTCCATCATCTGAAGACCTATTCGACGTGTCGGATTGAGGGCGGTCATCGGTTCCTGAAAGATCATCGCCCCTCGTGCGC + ABCCCCCCFFFFGGGGGGGGGGHHGHGGGGGHHGGGGGGGHHHHHHHHHGGGGGHHGGGGGHHHHHHFHHHHHHHHHHHGGGGGHFGGHHGHHHFHHHGHHHHHHGDGGCGGDFFHHHHGGHHHHHHHFGCGGGGHGHHGFHGGGG0=DGEGHHHHGGHBBEGDA//EGGBEGEBFFGGFABDFFBDADEFFFFFFFFFFFFFEF..-9:-AE.9= @M70114:17:000000000-ALHVJ:1:1101:15715:1368\2 GTACTTGCTCCACTGGTTTTTGCCACCGCCAGCACTGACGCATTGGCAGCAATTCAACAGCAATACGGTGCACAAAAAGCCAGTCAGGCAGTAGAAACTCTGTTTTCTCAACTAGCGGCGCGGTTAGCAGCGGAAGGCGTGACACGCTTTATTGTCGCAGGCGGTGAGACCTCCGGCGTAGTCACACAGAGCCTGGGCATAAAAGGGTTTCATATCTTTCTTCTATACACATCTGACGCTGCCGAAGAAA + BCBBBFFFFFFFGGGGGGGGGGHHHHHGGGGGGHHHHHHGGGGGHFGGGGFFHHHHHHGHHHHGHHHGHHGHHHBFHEEHHHHHEHHHHHGHHHFBDFFHFHBFGHEHHHFFHGGGCG/BCGGD/CGFHHHGGGFADHCEGGGF<DFGGFHHHHHGG.ADCCDC--..09.F0C??-@---A//;F/..B//9/AE..;;/:F/B.:..;B/99:////////;/;///://;9///......---9.//
On Tue, Sep 13, 2016 at 3:26 PM, algbioi notifications@github.com wrote:
Thanks for your interest in using Snowball. There is something wrong with reading the input files, could you please copy paste here the first 5 reads of each input file and check that all reads are of the same length 250? I would just like to check that they are in the right format (e.g. that the read names end with "/1" "/2", e.g. a paired-end read is named as @NZ_AIEZ01000035.1-2/1 and @NZ_AIEZ01000035.1-2/2 )
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/algbioi/snowball/issues/2#issuecomment-246795319, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJqnSx9pWFoJzY-Uia1O2SXzVkzhFJsks5qpvjIgaJpZM4J7-qZ .
Erica Lasek-Nesselquist Assistant Professor University of Scranton Scranton, PA 18510 email: erica.lasek-nesselquist@scranton.edu, elasekness@gmail.com
Thanks! The read names looks good. Yes, unfortunately, one has to trim all the reads to the same size (and probably filter out short reads). Nevertheless, the longer the reads, the better, the algorithm can handle bases with low quality scores that are typically at the end of the reads.
Thanks for the clarification. I'll let you know how it goes.
On Tue, Sep 13, 2016 at 5:01 PM, algbioi notifications@github.com wrote:
Thanks! The read names looks good. Yes, unfortunately, one has to trim all the reads to the same size (and probably filter out short reads). Nevertheless, the longer the reads, the better, the algorithm can handle bases with low quality scores that are typically at the end of the reads.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/algbioi/snowball/issues/2#issuecomment-246822868, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJqncMoIoj23oJ-x4Ukggu3mAA1ZHwTks5qpw8VgaJpZM4J7-qZ .
Erica Lasek-Nesselquist Assistant Professor University of Scranton Scranton, PA 18510 email: erica.lasek-nesselquist@scranton.edu, elasekness@gmail.com
I am trying to run snowball with the command: python algbioi/ga/run.py -f /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R1.fastq.gz -s /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R2.fastq.gz -m /local-homes/bioinformatics/erica/metagenome_stool/pfam-snowball.hmm -o both_contigs.fna.gz -i 338 -r 250
but it looks like there is an issue with the multiprocessing module I receive this to stdout: This hmmsearch binary will be used: /usr/bin/hmmsearch Using temporary directory: /tmp/snowball_vxNG2N Running on: Ubuntu 12.04 precise (linux2) Using 32 processors Settings: Read length: 250 Insert size: 338 Min. overlap probability: 0.8 Min. overlap length: 0.5 Min. HMM score: 40 Joining paired-end reads into consensus reads, loading reads from: /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R1.fastq.gz /local-homes/bioinformatics/erica/metagenome_stool/both_allseqs_R2.fastq.gz Traceback (most recent call last): File "algbioi/ga/run.py", line 493, in
_main()
File "algbioi/ga/run.py", line 468, in _main
outAnnot=outAnnot, cleanUp=cleanUp, processors=processors)
File "algbioi/ga/run.py", line 175, in mainSnowball
maxCpu=comh.MAX_PROC)
File "/local-homes/bioinformatics/erica/metagenome_stool/snowball_1_2/algbioi/com/fq.py", line 150, in joinPairEnd
retList = parallel.runThreadParallel(taskList, maxCpu)
File "/local-homes/bioinformatics/erica/metagenome_stool/snowball_1_2/algbioi/com/parallel.py", line 101, in runThreadParallel
retValList.append(taskHandler.get())
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: string index out of range
Any ideas on how to resolve this problem? Thank you in advance for your help, Erica