fhs / ZipFile.jl

Read/Write ZIP archives in Julia
Other
51 stars 45 forks source link

SystemError seek: Bad file descriptor #14

Open andrewcooke opened 9 years ago

andrewcooke commented 9 years ago
julia> using ZipFile

julia> io = ZipFile.Reader("test/gml/polblogs.zip").files[1]
ZipFile.ReadableFile(name=polblogs.gml, method=Deflate, uncompresssedsize=977839, compressedsize=93369, mtime=1.156468828e9)

julia> readline(io)
"Creator \"Lada Adamic on Tue Aug 15 2006\"\n"

julia> readline(io)
ERROR: SystemError: seek: Bad file descriptor
 in seek at ./iostream.jl:49
 in read at /home/andrew/.julia/v0.4/ZipFile/src/ZipFile.jl:410
 in readuntil at io.jl:174
 in readuntil at io.jl:156
 in readline at io.jl:217

the file is from http://www-personal.umich.edu/%7Emejn/netdata/polblogs.zip

if i do zmore or similar at the command line it has plenty more lines.

am i doing something dumb or is this an issue in your library? i was hoping it would a simple IO instance i could treat like a file (including rewind).

thanks.

edit:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5928 (2015-07-12 04:57 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a9e0dd2 (5 days old master)
|__/                   |  x86_64-suse-linux

ZipFile was latest (Pkg.update()) at time of posting.

[edit2: cut + paste header from wrong julia - this was with 0.4 trunk, as updated above]

andrewcooke commented 9 years ago

works fine on 0.3 with the same file / machine by the way.

andrewcooke commented 9 years ago

well, updating julia to the latest from git seemed to fix this, so i guess it was a problem with trunk 5 days ago!

andrewcooke commented 9 years ago

spoke too soon! it now occurs after some random number of readlines, around 2000 or 3000.

(and so does 0.3 if you wait long enough!)

(that's 0.3 from git, not a released 0.3)

jonathanBieler commented 5 years ago

I'm getting the same error with this, from Immerse.jl:

const testdir = splitdir(@__FILE__)[1]
const facesdir = joinpath(testdir, "orl_faces")
const orl_url = "http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data/att_faces.zip"

function unzip(inputfilename, outputpath=pwd())
    r = ZipFile.Reader(inputfilename)
    for f in r.files
        outpath = joinpath(outputpath, f.name)
        if isdirpath(outpath)
            mkpath(outpath)
        else
            open(outpath, "w") do io
                write(io, read(f))
            end
        end
    end
    nothing
end
julia> unzip(fn, facesdir)
ERROR: SystemError: seek: Bad file descriptor
Stacktrace:
 [1] #systemerror#39(::Nothing, ::Function, ::String, ::Bool) at ./error.jl:106
 [2] systemerror at ./error.jl:106 [inlined]
 [3] seek(::IOStream, ::Int64) at ./iostream.jl:101
 [4] read(::ZipFile.ReadableFile, ::Array{UInt8,1}) at /Users/jbieler/.julia/packages/ZipFile/02Psc/src/ZipFile.jl:452
 [5] read at /Users/jbieler/.julia/packages/ZipFile/02Psc/src/iojunk.jl:11 [inlined]
 [6] readbytes!(::ZipFile.ReadableFile, ::Array{UInt8,1}, ::Int64) at ./io.jl:813
 [7] read at ./io.jl:836 [inlined]
 [8] read(::ZipFile.ReadableFile) at ./io.jl:835
 [9] (::getfield(Main, Symbol("##23#24")))(::IOStream) at ./REPL[56]:9
 [10] #open#298(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(Main, Symbol("##23#24")), ::String, ::Vararg{String,N} where N) at ./iostream.jl:369
 [11] open at ./iostream.jl:367 [inlined]
 [12] unzip(::String, ::String) at ./REPL[56]:8
 [13] top-level scope at none:0

Anything wrong here ?

beniacm commented 5 years ago

I have the same problem with v0.8.1, it seems to be related to the finalizer in ZipFile.Reader, once the Reader reference goes out of scope, the zip file IO stream will be closed and seek fails

this workaround works for me in a similar function: global r = ZipFile.Reader(inputfilename)

sylvaticus commented 5 years ago

It seems we have to explicitly close the file. Would it be possible for Zipfile to support the do block like a normal file IO operation ? e.g.:

ZipFile.Writer("example.zip") do w
  f1 = ZipFile.addfile(w, "file1.txt");
  write(f1, "hello world!\n");
end
kafisatz commented 4 years ago

I just ran into this too. I was about to compare runtimes of unzipping versus reading from the zip. Especially when using @btime the error occurs.¨

The global approach works, but seems suboptimal.

MWE

using CSV 
using DataFrames
using ZipFile 

src = raw"https://www.stats.govt.nz/assets/Uploads/Electronic-card-transactions/Electronic-card-transactions-June-2020/Download-data/electronic-card-transactions-june-2020-csv-tables.zip"

function readfromzip(zipFile,csvSep)
    z = ZipFile.Reader(zipFile)
    zippedcsv = filter(x->splitext(x.name)[2]==".csv",z.files)[1]
    aDf = CSV.read(read(zippedcsv),DataFrame,delim=csvSep,copycols=true,pool=false,lazystrings=true);
    return aDf
end

function unzipandread(zipFile,csvSep)
    outputFolder = mktempdir()
    cmd=`7z e $(zipFile) \*.csv -o$(outputFolder)`
    read(cmd)
    fi=readdir(outputFolder,join=true)[1]
    aDf = CSV.read(fi,DataFrame,delim=csvSep,copycols=true,pool=false,lazystrings=true);
    return aDf
end

zipFile = download(src);
csvSep=','

@time d1 = unzipandread(zipFile,csvSep);
@time d2 = readfromzip(zipFile,csvSep);
@assert isequal(d1,d2)

@btime unzipandread(zipFile,',');
@btime readfromzip(zipFile,',');

#either @time or @btime of readfromzip throws this error

ERROR: SystemError: seek: Bad file descriptor
Stacktrace:
 [1] systemerror(::String, ::Int32; extrainfo::Nothing) at .\error.jl:168
 [2] #systemerror#50 at .\error.jl:167 [inlined]
 [3] systemerror at .\error.jl:167 [inlined]
 [4] seek(::IOStream, ::Int64) at .\iostream.jl:108
 [5] read(::ZipFile.ReadableFile, ::Type{UInt8}) at C:\Users\me\.julia\packages\ZipFile\AwgTV\src\ZipFile.jl:488
 [6] readbytes!(::ZipFile.ReadableFile, ::Array{UInt8,1}, ::Int64) at .\io.jl:889
 [7] read at .\io.jl:912 [inlined]
 [8] read at .\io.jl:911 [inlined]
 [9] readfromzip(::String, ::Char) at .\REPL[29]:4
 [10] top-level scope at .\util.jl:175

 versioninfo()

 julia>  versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 16
  JULIA_EDITOR = "C:\Program Files\Microsoft VS Code\Code.exe"
nlw0 commented 4 years ago

I believe I'm facing this issue as well. First you need to keep the dir object assigned before you do the "read", you can't just keep one of the file objects. Furthermore, this dir object apparently has to be in the global scope.

nilshg commented 2 years ago

I just ran into this - has there ever been a reliable workaround? I'm doing this:

for zf ∈ zip_files
    rf = only(ZipFile.Reader(zf).files)
    if rf.name ∉ readdir(zip_folder_path) # Check whether file exists already to avoid duplication
        read_file = read(rf)
        target_dir = normpath(zip_folder_path, rf.name)
        write(target_dir, read_file)
    end
end

This randomly errors for some files, but when I run it multiple times I'll eventually get through the whole list of files (I've got 86 zip files in the directory), i.e. there aren't any "bad" files in there.

I could of course just add a try/catch and then wrap the whole loop in a for _ in 1:50; ...; end loop which hopefully successfully unzips all files, but that seems a bit brittle...

kafisatz commented 2 years ago

your suggested try/catch loop could be improved with an while loop where you check for success for each file to be unzipped. Still brittle, but no need to loop to 50 :)

mattwigway commented 2 years ago

Just ran into this. I think the problem is the underlying file gets closed if Julia garbage-collects the reader. To ensure that doesn't happen you can use a pattern like this:

rdr = ZipFile.Reader(filename)
# ... do stuff with rdr.files
close(rdr)

The last call to close is very important as referencing the reader here prevents the compiler/garbage collector from trashing the reader before we get here, as it knows we will still need it at this point for the close call. If you are opening a ZipFile with an existing io object, close() will be a no-op but I think it should still prevent gc.

mattwigway commented 2 years ago

I think this could be fixed on the ZipFile side by having a reference to the original Reader inside ReadableFile (and similarly for Writer/WritableFile). That way the Reader can't be gc'ed while there are still ReadableFile instances in scope.

ryofurue commented 1 year ago

So, is this because the garbage collector throws away the original ZipFile.Reader object? If that's the case, referring to the object in the global scope will be a workaround.

I'm using julia 1.9.2 and ZipFile v0.10.1 .

Here is my sample code that crashes with "seek: Bad file descriptor" while reading from a large-ish file in the zip archive:

using ZipFile

function openzipstream()
  r = ZipFile.Reader("tmp.zip")
  display(r.files)
  return r.files[2]
end

function printout()
  is = openzipstream()
  println(readline(is))
  cnt = 1
  for line in eachline(is)
    println("$(cnt): $(line)")
    cnt += 1
  end
end

printout()