IndexError: list index out of range

apastore commented 9 years ago

I have download the SRR1613972.sra from http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR1613972

after fastq-dump -I --split-file SRR1613972.sra

I run tag_to_header.py and I get the following error.

Thanks for you support!

python tag_to_header.py --infile1 SRR1613972_1.fastq --infile2 SRR1613972_2.fastq --outfile1 read_1.fq.smi --outfile2 read_2.fq.smi --barcode_length 12 --spacer_length 5

Traceback (most recent call last): File "tag_to_header.py", line 196, in main() File "tag_to_header.py", line 163, in main read1.name = hdrRenameFxn(read1.name, tag1, tag2) File "tag_to_header.py", line 121, in hdrRenameFxn return("%s|%s%s/%s" % (x.split("/")[0], y, z, x.split("/")[1]))

pkMyt1 commented 9 years ago

I had the same problem. I traced it to the version of Casava used on the sequencer. There is nothing in the split for [1] that is why you have an error. You cannot split using "/" any longer. I fixed it by completely changing that line but that requires changes to other lines in that and other files as well.

apastore commented 9 years ago

You have modified tag_to_header line 121 return("%s|%s%s/%s" % (x.split("/")[0], y, z, x.split("/")[1]))

to

return("%s|%s%s/%s" % (x.split(".")[0], y, z, x.split(".")[1]))

thanks!

pkMyt1 commented 9 years ago

I actually rearranged the whole thing and put the barcode at the beginning of the header but like I said, that requires other coding changes. Hopefully one of the Loeb lab members will be along to weigh in on this soon.

On Thu, Jan 15, 2015 at 3:19 PM, apastore notifications@github.com wrote:

You have modified tag_to_header line 121 return("%s|%s%s/%s" % (x.split("/")[0], y, z, x.split("/")[1]))

to

return("%s|%s%s/%s" % (x.split(".")[0], y, z, x.split(".")[1]))

thanks!

— Reply to this email directly or view it on GitHub https://github.com/loeblab/Duplex-Sequencing/issues/3#issuecomment-70154247 .

bkohrn commented 9 years ago

I looked at the file you referenced, and it looks like part of the issue is the way that the SRA labels files; I'll see what I can do about it, but it seems another issue is the fact that we at the Loeb lab traditionally start from qseq files and demux ourselves, which means we can place things like read number wherever we want. I have written a short script that should reformat SRA reads to a format that won't break the rest of the pipeline.

apastore commented 9 years ago

Cool! thanks!

bkohrn commented 9 years ago

The script should be in the TestData folder (as it mostly pertains to the test data set). I'll get to work on the other problem (Illumina changing their output format) soon.

SRAFixer.py: https://github.com/loeblab/Duplex-Sequencing/blob/master/TestData/SRAFixer.py

apastore commented 9 years ago

I have tried zoo script but still the reads name are not split correctly. when I type head SRR1613972_1.fastq. This are the first reads i get

@SRR1613972.1 HWI-7001239F_017:1:1101:1226:2127 length=101 NAATGACTTAAATGNCTTACACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATCATGATCTTTATAATTTGAGAN +SRR1613972.1 HWI-7001239F017:1:1101:1226:2127 length=101 BOV]^^^^WQBQQ\^^^^^^^^^]^^]]]^^]]^^^^^[^]^^^^^^]\^__]^]^_]____^^^^^^^^^^^]^^^[^B @SRR1613972.2 HWI-7001239F_017:1:1101:1472:2086 length=101 NACCCTTTCGTGTCNCTGTGAAGAAGCATTCGGAAGCATCTTTGCAGGATTTGTCATCTCATATAATATTCCACCAACCAGCATTTCAGTCCTCACAATAC +SRR1613972.2 HWI-7001239F_017:1:1101:1472:2086 length=101 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @SRR1613972.3 HWI-7001239F_017:1:1101:1374:2162 length=101 CGTCTCCGGAGGTGNCTGTAGAAGATGAATCCTAGTAGCCAACCTACTTATCTTAACCTGAATTGGGGGCCAACCAGTAGAACACCCATTTATTATCATTG

python TestData/SRAFixer.py --infile SRR1613972_1.fastq --outfile ~/Documents/forschung/MSKCC/Duplex_seq/SRR1613972_1.fastq.fix Traceback (most recent call last): File "TestData/SRAFixer.py", line 43, in readNum = line.split(' ')[0].split('.')[2] IndexError: list index out of range

I have modified this line readNum = line.split(' ')[0].split('.')[2] with readNum = line.split(' ')[0].split('.')[1]

tsibley commented 9 years ago

I believe the remaining issue here with the CASAVA header format is resolved by #5.

scottrk commented 4 years ago

Closing as resolved.

Kennedy-Lab-UW / Duplex-Sequencing

IndexError: list index out of range #3