BioJulia / Libz.jl

Fast, flexible zlib bindings.
Other
27 stars 17 forks source link

lines getting lost #58

Open jnmaloof opened 7 years ago

jnmaloof commented 7 years ago

I have tried writing a julia script to subset lines out of a gzipped file. The file (a fastq file) has records that are 4 lines long. So I take 4 lines from the input file, and a fraction of the time write them to an output file. Thus the number of lines in the output file should always be a multiple of 4 (assuming that the input file has complete records...)

Unfortunately this is not the case, lines are getting lost. If I use a similar script to process uncompressed files and don't use Libz, then it works fine.

I am using Libz 0.2.4

Am I doing something wrong or is this a bug?

Thanks,

Julin

#! /usr/local/bin/julia

using Libz

length(ARGS)

if length(ARGS) == 0 || ARGS[1] == "-h" || ARGS[1] == "--help"
  println("The script will subset a gzipped fastq file to keep a percentage of reads")
  println()
  println("USAGE: Subset_fastq.jl KEEP FILE1 FILE2")
  println("where KEEP is a number betweeen 0 and 1 indicating the proportion of reads to keep")
  println("new files will be created in the directory where this script is invoked with the prefix `subset_`")
  exit()
end

keep=float(ARGS[1]) # fraction of reads to keep

for f in ARGS[2:length(ARGS)]
  if ismatch(r"\.fq\.gz$", f)
    println("processing $f")
    newname="subset_$f"
    fs=open(f) |> ZlibInflateInputStream
    outfs=open(newname,"w") |> ZlibDeflateOutputStream
    while !eof(fs)
      chunk=[readline(fs, chomp=false) for i=1:4]
      if rand() <= keep
        write(outfs,chunk);
      end # if rand
    end #while
    close(fs)
    close(outfs)
  end #if ismatch
end # for

demo

~/g/B/2/ForSubset (master) $ ../Subset_fastq.jl 0.1 wyo_leaf_R500_02_055.R1.fq.gz
processing wyo_leaf_R500_02_055.R1.fq.gz
~/g/B/2/ForSubset (master) $ gzcat subset_wyo_leaf_R500_02_055.R1.fq.gz | wc -l
 5181253 # output file lines not divisible by 4
~/g/B/2/ForSubset (master) $ gzcat wyo_leaf_R500_02_055.R1.fq.gz | wc -l
 51884848 # input file lines are divisible by 4
bicycle1885 commented 7 years ago

That looks a bug of Libz.jl. I'll take a look as soon as possible.

What happens if you use CodecZlib.jl instead of Libz.jl? CodecZlib.jl is a replacement of Libz.jl and now I'm focusing on the development of it. You can use fs = GzipDecompressionStream(open(f)); outfs = GzipCompressionStream(open(newname, "w")) in place of Libz.jl functions.

jnmaloof commented 7 years ago

It works correctly when I use CodecZlib.jl . Thanks!

bicycle1885 commented 7 years ago

CodecZlib.jl is still young but has a simpler design. So I think it is less buggy (but people still use Libz.jl so I'll need to fix it).

jnmaloof commented 7 years ago

good luck, and thanks for making these tools available!

Julin Maloof Professor Department of Plant Biology 1 Shields Ave University of California, Davis Davis, CA 95616 (530) 752 8077

On Fri, Aug 11, 2017 at 6:47 AM, Kenta Sato (佐藤 建太) < notifications@github.com> wrote:

CodecZlib.jl is still young but has a simpler design. So I think it is less buggy (but people still use Libz.jl so I'll need to fix it).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BioJulia/Libz.jl/issues/58#issuecomment-321816596, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgfmOGx9LI4OYvel-hO7ENYV6_wZNo-ks5sXFuGgaJpZM4Oy14G .