JuliaIO / GZip.jl

A Julia interface for gzip functions in zlib
https://juliaio.github.io/GZip.jl/dev
MIT License
39 stars 30 forks source link

eachline() reports extra line in GZip file but not in unzipped file #18

Open slundberg opened 10 years ago

slundberg commented 10 years ago

I found an issue where eachline() was returning an extra empty line "" after the end of a gz file I was reading. The file ends in a single newline, and has 171 total lines. Reading the uncompressed file works fine, but as the output below shows reading from the GZip stream produces a spurious blank line.

This only happens for this file (thousands of other such files worked fine) and if I change the file more than just a character or two the bug goes away. Unfortunately this is medical data so I can't attach the file, but see the output below (using the current version of GZip):

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.3.0-rc4 (2014-08-15 04:01 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-redhat-linux
> using GZip
> open(f->length(readlines(f)), "/tmp/orig")
171
> GZip.open(fout->write(fout, open(readall, "/tmp/orig")), "/tmp/orig.gz", "w");
> GZip.open(f->length(readlines(f)), "/tmp/orig.gz")
172
> GZip.open(f->readlines(f), "/tmp/orig.gz")[end]
""
> open(fout->write(fout, GZip.open(readall, "/tmp/orig.gz")), "/tmp/orig2", "w");
> open(f->length(readlines(f)), "/tmp/orig2")
171
kmsquire commented 10 years ago

@slundberg, thanks for the report.

Can you give the zlib version you're using? You can get it with GZip.zlib_version.

Also, can you try installing the Zlib package and running the same test using Zlib.reader(open("/tmp/orig"))?

kmsquire commented 10 years ago

Sorry, that's Zlib.Reader(open("/tmp/orig")).

slundberg commented 10 years ago

Same issue with Zlib:

> GZip.zlib_version
"1.2.3"
> length(readlines(Zlib.Reader(open("/tmp/orig.gz"))))
172

I should also note that when I compress the file using the gzip from the command line and then read the file everything is fine (at least for this file), so it only happens during a full read write cycle.

slundberg commented 10 years ago

One further update...read write read with Zlib works, but I don't know if it's just because I may have chosen a different compression level than GZip uses by default.

> f = open("/tmp/orig.gz", "w")
> zf = Zlib.Writer(f, 9)
> write(zf, open(readall, "/tmp/orig"))
> length(readlines(Zlib.Reader(open("/tmp/orig.gz"))))
171
kmsquire commented 10 years ago

I was just going to suggest doing that. :-) At least that gives you a workaround.

You can set the compression level for gzip by appending the number to the file mode, e.g.,

f = GZip.open("/tmp/orig.gz", "w9");

Can you try that? Also, is it possible for you to try with a later version of zlib?

slundberg commented 10 years ago

Matching compression levels at 6 creates the issue with GZip but not Zlib.

It also looks like it could be the zlib version. I can't change that on the server very easily but on my macbook with zlib 1.2.5 I don't see the issue.

Perhaps I can get a newer zlib sometime soon on the server and see if that resolves it there as well. For now I can just check for empty lines.

kmsquire commented 10 years ago

GZip calls gzwrite, and Zlib doesn't, so that probably explains the difference. If you increase the buffer size for the write in ZLib, it might even be faster, if that matters. In the past, I've thought about merging those packages, since they're somewhat redundant, but I doubt I'll get to it anytime soon.

The zlib changelog shows a few fixes in gzwrite after version 1.2.3, so perhaps one of them fixed the issue.

Unfortunately, I'm not sure how we could detect this issue in GZip.jl, especially without a test example. If you have any thoughts, let me know.

slundberg commented 10 years ago

Thanks for being responsive on this! I ran a bunch of random tests and found a random file that had the same error after about 20k tries.

https://www.dropbox.com/s/qppddaryvgcmenl/test?dl=0

Perhaps it will give you the same error. If not it might be restricted to the zlib version I have.

slundberg commented 10 years ago

I also ran this script on my macbook and found the same issue in a different random file, so I don't think the version is the issue. Perhaps you can run this and see if you find one on your setup? (you may need to increase the number of runs past 10k)

using GZip
using StatsBase
for i in 1:10000
    f = GZip.open("/tmp/test.gz", "w")
    numLines = sample(100:800)
    for j in 1:numLines
        println(f, join(sample(["aasdf", "dfs", "ds", "q", " ", " ", "s", "t", "b", "e", "hdffda sdf", "sdf", "xjkd", "df:0.1"], sample(10:300)), ""))
    end
    close(f)
    foundLines = GZip.open(f->length(readlines(f)), "/tmp/test.gz")
    if foundLines != numLines
        println("Found example! $foundLines $numLines")
        break
    end
end
realzhang commented 5 years ago

same problem with GZip.zlib_version "1.2.7",Julia Version 1.1.0 any gz file on my CentOS will give an extra line by GZip.jl, but no with unzipped plain text file. Any suggestion? Thanks a lot!