JuliaIO / Tar.jl

TAR files: create, list, extract them in pure Julia
MIT License
79 stars 19 forks source link

option to ignore invalid checksums #91

Closed dpo closed 1 year ago

dpo commented 3 years ago

What am I doing wrong?

julia> url
"https://sparse.tamu.edu/MM/Oberwolfach/LF10.tar.gz"

julia> download(url, "LF10.tar.gz");

julia> Tar.list("LF10.tar.gz")
ERROR: invalid octal digit: 'V'
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] read_header_int(::SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}, ::Int64, ::Int64) at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:534
 [3] read_standard_header(::IOStream; buf::Array{UInt8,1}, tee::Base.DevNull) at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:489
 [4] iterate_headers(::Tar.var"#71#72"{Array{Tar.Header,1}}, ::IOStream; raw::Bool, strict::Bool, buf::Array{UInt8,1}) at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:17
 [5] #68 at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:131 [inlined]
 [6] open(::Tar.var"#68#69"{Bool,Bool,Tar.var"#71#72"{Array{Tar.Header,1}}}, ::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:325
 [7] open at ./io.jl:323 [inlined]
 [8] arg_read at /Users/dpo/.julia/packages/ArgTools/4vlk9/src/ArgTools.jl:42 [inlined]
 [9] #list#67 at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:130 [inlined]
 [10] #list#70 at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:141 [inlined]
 [11] list(::String) at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:140
 [12] top-level scope at REPL[23]:1

julia> Tar.extract("LF10.tar.gz")
ERROR: invalid octal digit: 'V'  # same error

shell> tar ztvf LF10.tar.gz
-rw-------  0 davis  0        1662 Jan 30  2007 LF10/LF10.mtx
-rw-------  0 davis  0         367 Jan 30  2007 LF10/LF10_B.mtx
-rw-------  0 davis  0         367 Jan 30  2007 LF10/LF10_C.mtx
-rw-------  0 davis  0        1550 Jan 30  2007 LF10/LF10_E.mtx
-rw-------  0 davis  0        1651 Jan 30  2007 LF10/LF10_M.mtx

julia> VERSION
v"1.5.3"

macOS 10.15.7.

giordano commented 3 years ago

Tar.jl reads only uncompressed tarballs, you need to gunzip it first, see for example the Compression section of the README

dpo commented 3 years ago

Indeed, thanks. I'm not having much luck with that either though:

julia> f = GzipDecompressorStream(open("LF10.tar.gz"))
TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}(<mode=idle>)

julia> Tar.extract(f)
ERROR: incorrect header checksum = 0; should be 4674
"LF10/LF10.mtx\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x00100600 \0     0 \0     0 \0       3176 10557604363         0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ustar\0\0\0davis\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0     0 \0     0 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] read_standard_header(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}; buf::Array{UInt8,1}, tee::Base.DevNull) at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:501
 [3] #read_header#47 at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:370 [inlined]
 [4] read_tarball(::Tar.var"#25#27"{Array{UInt8,1},Bool,TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream},String}, ::Tar.var"#1#2", ::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}; buf::Array{UInt8,1}, skeleton::Base.DevNull) at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:331
 [5] extract_tarball(::Function, ::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}, ::String; buf::Array{UInt8,1}, skeleton::Base.DevNull, copy_symlinks::Bool) at /Users/dpo/.julia/packages/Tar/DQaSa/src/extract.jl:57
 [6] (::Tar.var"#76#79"{String,TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream},Tar.var"#1#2"})(::Base.DevNull) at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:201
 [7] arg_write(::Tar.var"#76#79"{String,TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream},Tar.var"#1#2"}, ::Base.DevNull) at /Users/dpo/.julia/packages/ArgTools/4vlk9/src/ArgTools.jl:94
 [8] (::Tar.var"#75#78"{TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream},Tar.var"#1#2"})(::String) at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:200
 [9] arg_mkdir(::Tar.var"#75#78"{TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream},Tar.var"#1#2"}, ::Nothing) at /Users/dpo/.julia/packages/ArgTools/4vlk9/src/ArgTools.jl:145
 [10] #74 at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:196 [inlined]
 [11] arg_read(::Tar.var"#74#77"{Tar.var"#1#2",Nothing}, ::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}) at /Users/dpo/.julia/packages/ArgTools/4vlk9/src/ArgTools.jl:43
 [12] extract(::Function, ::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}, ::Nothing; skeleton::Nothing, copy_symlinks::Nothing) at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:195
 [13] #extract#80 at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:217 [inlined]
 [14] extract at /Users/dpo/.julia/packages/Tar/DQaSa/src/Tar.jl:217 [inlined] (repeats 2 times)
 [15] top-level scope at REPL[15]:1

Zlib also gives an error. tar zxf happily extracts the archive but I'm trying to not rely on shell tools. Sorry if this is a basic question.

Edit: gunzipping the archive and then running Tar.extract fails with the same error.

giordano commented 3 years ago

It does work for me with other tarballs. I suspect that specific tarball might use non-standard extensions, probably Stefan knows more about that

StefanKarpinski commented 3 years ago

Looks like whatever software produced this tarball didn't set the checksum field — it's all spaces, which is what it's supposed to be when you're computing the checksum value that's supposed to be written to this field afterwards. The sum of bytes in the header block is then supposed to be written over the spaces in this field, which doesn't seem to have been done here. So it fails the checksum integrity check as indicated by the error message. We could potentially add an option to ignore checksums when extracting tarballs, but it definitely shouldn't be a default.

dpo commented 3 years ago

Thanks for the info. I started noticing issues when moving from Appveyor to Github Actions. The Windows workers apparently don't use the same version of tar and Actions's tar constantly errors out. I thought this package might help but as you say, the issue may be with the archives in the first place.

Do artifacts use a more sophisticated extraction method that would allow me to sidestep this problem, or will I have the same issue if I try to access those tarballs as artifacts?

And finally, how does the tar executable on my Mac handle it? Ignore the checksum?

StefanKarpinski commented 3 years ago

There's no question of a "more sophisticated extraction method" — the tarballs are invalid. Some tar implementations may skip checking the checksum, but they should not. On Julia ≤ 1.5 Pkg uses whatever system tar happens to exist, which may or may not work, depending on whether that implementation checks checksums or not. Julia ≥ 1.6, uses Tar.jl which does check the checksums. My recommendation is to download these files, extract them with a tar implementation that ignores checksums and then repackage them with a correct tar implementation (possibly Tar.create). I could add an option to ignore bad checksums in Tar.extract etc., which would not be on by default, but would allow using Tar.rewrite to read and recreate a correct tarball.

dpo commented 3 years ago

I think it would make sense to at least have options to extract archives that other tar implementations out there do extract, even if those options are marked as "use with caution".

Thanks. I'll investigate repackaging those tarballs.

StefanKarpinski commented 3 years ago

Changed to a feature request for an option to ignore invalid checksums.

StefanKarpinski commented 1 year ago

I was looking at old issues and I checked the example tarball here and it's the same issue as #111: the checksum field has non-standard leading space that was causing the checksum to be interpreted as zero. This has been fixed since by allowing leading spaces in the checksum field.