alexstaj / cutadapt

Automatically exported from code.google.com/p/cutadapt
0 stars 0 forks source link

On occasions will produce corrupt gzip files #79

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
After running cutadat in paired end mode the fastq file for read 2 is corrupt 
with gzip reporting an unexpected end of file. This does not occur with the 
same files as either uncompressed or bzip 2.

$ cutadapt --version
1.5

-rw-rw-r-- 1 jellul jellul 116176792 Sep  8 15:06 
1516_AGGCTAAC_L001_R1_001.fastq.gz
-rw-rw-r-- 1 jellul jellul 119799808 Sep  8 15:06 
1516_AGGCTAAC_L001_R2_001.fastq.gz

$ gzip -dc Halo_Fix/1516/1516_AGGCTAAC_L001_R1_001.fastq.gz | grep -n -A 3 
"D81P8DQ1:225:HAN1FADXX:1:2216:16087:99121"
4947753:@D81P8DQ1:225:HAN1FADXX:1:2216:16087:99121 1:N:0:AGGCTAAC
4947754-TTTTGTATTCATAAAGATGAAACGGACTTGCTATTTACTGATCAGCACAACATATGTCTTAAATTATCTGGC
CAGTTTATGAAGGAGGGAAACACTCAGATTAAAGAAGATTTGTCAGATTTAACTTTTTTGGAAGTTGCGAAAGCTCAA
4947755-+
4947756-=B@DFFFFHHHHHJJJJJJJJJJHIJJIJJJJJJJJIJJJJJJJJIIJJJIJJJJJHIJJJJJIIJIGIJIJ
JJIHIHHHHHHFFFFDDDDDDDDDDDDDDCCDDDDDDDDDDDEDECCACDDEDCDDDDDBDDDC@CCDDDBD<CCDDD

$ gzip -dc Halo_Fix/1516/1516_AGGCTAAC_L001_R2_001.fastq.gz | grep -n -A 3 
"D81P8DQ1:225:HAN1FADXX:1:2216:16087:99121"
4947753:@D81P8DQ1:225:HAN1FADXX:1:2216:16087:99121 2:N:0:AGGCTAAC

gzip: Halo_Fix/1516/1516_AGGCTAAC_L001_R2_001.fastq.gz: unexpected end of file
4947754-TGTTCTTTATTTGAAGTATTACCATGACATGCTTCTTGAGCTTTCGCAACTTCCAAAAAAGTTAAATCTGAC
AAATCTTCTTTAATCTGAGTGTTTCCCTCCTTCATAAACTGGCCAGATAATTTAAGACATATGTTGTGCTGATCAGT

Thanks Jason

Original issue reported on code.google.com by jas.pete...@gmail.com on 8 Sep 2014 at 5:13

GoogleCodeExporter commented 9 years ago
Is the output file always corrupt when run cutadapt with the same arguments 
again? How many lines does the output file have? (I'd like to know whether line 
4947754 is close to the end.)

Original comment by marcel.m...@tu-dortmund.de on 8 Sep 2014 at 6:32

GoogleCodeExporter commented 9 years ago
Hi Marcel,

Thanks for getting back.

FYI
$ gzip -dc Halo_Fix/1516/1516_AGGCTAAC_L001_R1_001.fastq.gz | wc -l
5068912

The commands used on all occasions are:
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-m 35 -q 15 \
-o /home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R1_001.temp.fastq.gz \
-p /home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R2_001.temp.fastq.gz \
/home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R1_001.orig.fastq.gz \
/home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R2_001.orig.fastq.gz

cutadapt \
-a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
-m 35 -q 15 \
-o /home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R2_001.fastq.gz \
-p /home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R1_001.fastq.gz \
/home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R2_001.temp.fastq.gz \
/home/jellul/Halo_Fix/1516/1516_AGGCTAAC_L001_R1_001.temp.fastq.gz

I have not tried with different params and 18 of 92 fastq pairs
experienced this error since I moved from 1.1 to 1.5

Original comment by jas.pete...@gmail.com on 8 Sep 2014 at 6:52

GoogleCodeExporter commented 9 years ago
Thanks for the info. So far, I haven't been able to reproduce the problem. 
Since no one else has reported this, I'd first like to exclude the possibility 
that it's a problem with your system. A simple explanation would be a disk 
space or quota problem. Another possibility is that your batch system (if you 
are using one) kills the job early. Did you check the logs that cutadapt 
produces? Were they written correctly, even when the output is corrupt?

In order to find out whether it's a bug in cutadapt, I need to know whether the 
corrupt output is reproducible. That is, if you get a corrupt output file from 
a given input file, and you then run cutadapt again, will you again get a 
corrupt output file?

By the way, you can use 'zgrep' to grep in gzipped files. For checking whether 
a file is corrupt, you can use 'zcat file.gz > /dev/null' . If the file is 
truncated, it'll print the 'unexpected end of file' message.

Original comment by marcel.m...@tu-dortmund.de on 8 Sep 2014 at 8:23

GoogleCodeExporter commented 9 years ago
Hi again Marcel,

I have done a fair bit of investigating and the problem only manifests when
I use a perl wrapper I had written to incorporate cutadapt into our
pipeline. It is reproducible with the particular fastq files and when using
the perl wrapper. The logs show no indication the command failed or
terminated early, for all intensive purposes it looks like the command
finished normally.

Had not experienced this with the older version 1.1 has anything changed to
the gzip side of things??

I have a feeling the environment perl is creating is causing gzip to
prematurely close both files while it is still in fact writing the 2nd file.

For now I am going to use a modified perl script to generate the cutadapt
commands (insert adapters etc according to fastqc output) and run the
commands in bash.

Thanks for looking into this,

Jason

Original comment by jas.pete...@gmail.com on 9 Sep 2014 at 2:58

GoogleCodeExporter commented 9 years ago
There actually was a change in cutadapt 1.2. Before that version, it used a 
Python library function to compress output or decompress input. But in 1.2, I 
switched to opening external gzip processes for both tasks. That is faster (at 
least for older Python versions) and also moves the (de-)compression into 
another process, so we get some degree of parallelism for free.

I just tried to reproduce the behavior you observed by killing one of the gzip 
processes spawned by cutadapt, but then the main cutadapt process notices it 
immediately and stops with an error message.

Maybe it's cutadapt's fault after all: I'm relying on the behavior that all 
open files are closed when a process ends. I'm not closing the output files 
explicitly. Let's see if I can fix this.

Original comment by marcel.m...@tu-dortmund.de on 9 Sep 2014 at 12:36

GoogleCodeExporter commented 9 years ago
If you have some time, I'd appreciate it if you could try out the newest 
cutadapt version available on Github: https://github.com/marcelm/cutadapt/ . 
I've tried to fix the bug (if it is one), but as I'm unable to trigger the 
problem myself, I cannot test it. You need to compile the checked-out cutadapt 
with 'python setup.py build_ext -i --cython'.

If you are not comfortable using Git or don't have Cython installed, I can 
provide you with a pre-compiled package.

Original comment by marcel.m...@tu-dortmund.de on 9 Sep 2014 at 12:57

GoogleCodeExporter commented 9 years ago
Hi Marcel,

Yes that seems to have fixed the problem. I will test it more tomorrow on
the other samples.

Thanks for all your help.

Kindest regards,

Jason

Original comment by jas.pete...@gmail.com on 10 Sep 2014 at 8:08

GoogleCodeExporter commented 9 years ago
Cool :-) Hopefully the other samples confirm the fix. If that was really it, 
then sorry for the bug, it is due to a (tiny) bit of laziness on my part.

Original comment by marcel.m...@tu-dortmund.de on 10 Sep 2014 at 8:18

GoogleCodeExporter commented 9 years ago
Hi Marcus,

Yes that has fixed the issue on all the samples which experienced this
problem thanks so much.

Oh and just to let you know previously I would place dummy adapters into
the command to just do bwa trimming but with this new version it errors out:

Traceback (most recent call last):
  File "/usr/local/cluster/all_arch/cutadapt/users/cutadapt", line 10, in
<module>
    cutadapt.main()
  File
"/usr/local/cluster/all_arch/cutadapt/users_python/cutadapt/scripts/cutadapt.py"
,
line 931, in main
    options.error_rate, readfilter.too_short, readfilter.too_long,
cmdlineargs, file=stat_file)
  File
"/usr/local/cluster/all_arch/cutadapt/users_python/cutadapt/scripts/cutadapt.py"
,
line 224, in print_statistics
    warning = warning or print_adjacent_bases(adapter.adjacent_bases,
adapter.sequence)
  File
"/usr/local/cluster/all_arch/cutadapt/users_python/cutadapt/scripts/cutadapt.py"
,
line 141, in print_adjacent_bases
    fraction = 1.0 * bases[base] / total
ZeroDivisionError: float division by zero

I am now testing without the dummy adapter but I thought I should let you
know. I am guessing you might want to do a check that total is not zero.

Original comment by jas.pete...@gmail.com on 11 Sep 2014 at 11:14

GoogleCodeExporter commented 9 years ago
Great to hear it's fixed! Regarding the bug you saw: That's due to a different 
change in the development version I introduced a few days ago, thanks for 
noticing. I just added a check for total == 0.

Regarding the quality trimming: You actually don't need to provide a dummy 
adapter! If you use just the quality trimming option -q, then cutadapt will 
happily do quality trimming only. It won't complain that you should have used 
-a or so. I'll update the documentation soon and will include a note about that.

Original comment by marcel.m...@tu-dortmund.de on 12 Sep 2014 at 5:44