Open suhastech opened 3 years ago
yeah we should handle this better.
I'm not sure how, but if we could abort the whole download that's probably the best option...
i googled a bit for some possible solutions here... https://stackoverflow.com/questions/15305203/what-to-do-with-errors-when-streaming-the-body-of-an-http-request
these two options seem like the best leads:
the trailer seems pretty robust, but i'm not sure how to do that. https://www.rubydoc.info/gems/rack/Rack/Chunked is the only mention of trailers in rails i could find.
a malformed chunk? not sure how to make rails do that either.
if anybody else has any clues, let me know
Hey, just an idea, it may be stupid, but may be not, so...
If zipline fails to fetch a file, or any other error occurs:
@skipped
list@skipped
files with corresponding http errorsI don't know the internal mechanism of zipline, so, if skipping is not possible, then write the closing sequence for the broken file (if needed/possible), write the "failures.log", and finish up with proper zip header (well, footer, technically) - to allow client to unarchive at least what have been fetched so far.
Optionally, add a configuration/options/settings/initialiser to customise the name and content (using proc) of the failures log file (or disable it completely).
WDYT?...
this might be possible if the failure happened at the beginning. if it happened in the middle of the file that data has already been streamed out to the user and it's not possible to walk it back as far as i know. i think i'd rather abort than have a chance that the user thinks this file is valid.
On Wed, Jun 21, 2023 at 3:38 AM Urist McUristurister < @.***> wrote:
Hey, just an idea, it may be stupid, but may be not, so...
If zipline fails to fetch a file, or any other error occurs:
- log this
- if it was a failure to fetch a file, add the file url to some kind of internal @skipped list
- if possible, do NOT add the failed file to archive. If not, better to have a working archive with a few broken files, than a whole broken archive. Especially if it's multiple GBs in size.
- move on to the next item on the list
- In the end, add a text file named, say, "failures.log", in which list the @skipped files with corresponding http errors
I don't know the internal mechanism of zipline, so, if skipping is not possible, then write the closing sequence for the broken file (if needed/possible), write the "failures.log", and finish up with proper zip header (well, footer, technically) - to allow client to unarchive at least what have been fetched so far.
Optionally, add a configuration/options/settings/initialiser to customise the name and content (using proc) of the failures log file (or disable it completely).
WDYT?...
— Reply to this email directly, view it on GitHub https://github.com/fringd/zipline/issues/78#issuecomment-1600336958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCSRKTWH6YT5OECDUXJ7DXMKQJFANCNFSM45UQOWHA . You are receiving this because you commented.Message ID: @.***>
that said, an errors.log would be better than nothing. i'll keep this in mind if we try the other options and can't get them working.
if it happened in the middle of the file that data has already been streamed out to the user and it's not possible to walk it back as far as i know
But there's no need to walk back.
Before switching to zipline, for many years we've used a "classic" approach of downloading files to the worker machine, zipping them, and then sending to the client. And from that experience, vast majority of errors related to remote files happen when trying to fetch a file - for example, a 404 or 5XX, or any other type of error. For something to happen when the remote file is in the middle of transfer - I can't say that I have ever encountered this. So if it's possible to just log it and skip it - that would be best.
But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list? Because it is still much more preferable to have a log + corrupted 20mb file + 1999 full files, rather than a broken 15Gb archive that you can't even examine nor salvage.
yeah actually that makes some sense to me... this gem is mostly for those giant downloads, so maybe just getting partial success is generally better. my worry is the user will not know that the 1 file in there is actually corrupt.
my worry is the user will not know that the 1 file in there is actually corrupt.
Hence the error.log ;) Better 1 corrupt file, than 15gb corrupt archive...
The problem is that a ZIP file which has been "chopped off" at the end is likely not to unarchive well, because the list of files inside the ZIP comes at the end of file. You can read the ZIP "straight-ahead" but in presence of data descriptors this is likely to fail on some apps, and it is very inefficient. It can also be faulty - for example, if entries are themselves ZIP files. What you need to make this work is a resumable download, similar to what we've done at WeTransfer back in the day. You might be able to put something together using raw zip_tricks and interval_response so that when you address the download again you will be able to resume your download. It is not trivial to do by any means but doable.
error.log would be a nice feature, albeit that when your download interrupts you already can't add anything to the response.
But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list?
If your concern is the "upstream" (where you are pulling the file from) timing out on you or similar, you can pull data from it in chunks and resume where it interrupts. HTTP Range
headers can be used for that.
@julik I'm not talking about "chopping off" zip files, I'm talking about chopped off files that are being requested. Or, which is actually more frequent, non existing, 404 files, that zipline is fetching from the url. If I specify a list of 500+ urls (S3, for example), and one of them in the middle of the list does not exist, then the whole 50gb archive is botched. And the frustration about downloading 30Gb file which is suddenly broken, and then you try to download it again, just to find out that it's still broken, is immense.
I'm not talking about chopping the zip file. I'm talking about at least skipping the failed archivable file. If the file being fetched is throwing an error, any error, then either skip writing it altogether or write what have been fetched so far (say, 50mb of 150), if it's a partially fetched file, and move on to the next file. And in the very end, in the list of the zipped files, say "ok, this file is 50mb in size, next".
That way when user downloads the archive, it will have only 1 file in the whole archive "broken" (zero-sized, partial, or not present at all), not the whole archive itself.
That should actually work if you surround your call to write_deflated_file
with a block with a rescue
. If you do that, zip_tricks will ~move on to the next file and not add the file you have written to the central directory.~ The likelihood is high that the unarchiving software opening the file will be able to unarchive correctly based on the central directory, which gets output in the end. Try adding a rescue
surrounding ZipGenerator#write_file
and try with some large-ish outputs interrupting them halfway?
Update - I was wrong, because even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form. You would need to muck about with the internals a bit:
module SaferZipWrite
# Removes a file that does not finish writing from the
# @files array in the Streamer, so that the file will not
# be present in the ZIP central directory. Some unarchivers
# might then still be able to unarchive files which do complete writing.
class StreamerWrapper < SimpleDelegator
def write_deflated_file(...)
files = __getobj__.instance_variable_get("@files")
n_files_before_writing = files.size
super
rescue => e
# do something with the exception
files.pop if files.size > n_files_before_writing
end
def write_stored_file(...)
files = __getobj__.instance_variable_get("@files")
n_files_before_writing = files.size
super
rescue => e
# do something with the exception
files.pop if files.size > n_files_before_writing
end
end
def write_file(streamer, file, name, options)
super(StreamerWrapper.new(streamer), file, name, options)
end
end
Zipline::ZipGenerator.prepend(SaferZipWrite)
Well... I actually did something similar: I've patched the write_file
method to first attempt the download of the target file, with up to 3 retries, in case of failure after 3 retries write a text file with a same name and error message, and then proceed with that file to streamer.write_stored_file
(stored
because we're serving pictures mostly, no BMPs, so it doesn't really make much sense to compress them even further. A "compression" option for zipline
method, maybe?). Sure, it's slower, since it has to download the file first, but more reliable. Still, not something that would be fit for everyone.
I'll try your suggestion, though, thanks! 👍 Maybe I don't have to download the whole file after all...
Still, would be nice to have this feature built-in ;)
even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form
That's still much more preferable than a broken archive! 👍 😉
"compression" option for
zipline
method, maybe?
that is for @fringd to ponder 😆 The feature would be nice, indeed - I'd wager it would have to be added in zip_tricks though and I haven't yet sorted the maintenance there. There could be a heuristic in zipline
where it picks stored
or deflated
depending on the filename extension for instance. Note that storing a stored
Excel or Word file can lead to interesting bugs because if a program reads a ZIP "straight ahead" it will sometimes think that the parts of those Office files are themselves entries of the "outer" ZIP 🤡 I remember the macOS ArchiveUtility used to do this, but don't know at which version they have fixed it.
PRs welcome of course. I'm assuming there isn't enough local storage for files generally though, so anything that waits for a whole file before continuing is a non-starter 😕
For those pondering "heuristic" choice between using stored or deflated modes, here is something that could work nicely. It will monitor how well the input compresses (you could also pick a ratio or something), and buffer roughly that amount of data. Then it will pick the more efficient option and open a file for you in the Streamer.
module HeuristicWrite
class DeflateHeuristic
BYTES_WRITTEN_THRESHOLD = 65*1024
def initialize(filename, streamer)
@filename = filename
@streamer = streamer
@buf = StringIO.new.binmode
@deflater = ::Zlib::Deflate.new(Zlib::DEFAULT_COMPRESSION, -::Zlib::MAX_WBITS)
@bytes_deflated = 0
@winner = nil
end
def <<(bytes)
if @winner
@winner << bytes
else
@buf << bytes
@deflater.deflate(bytes) { |chunk| @bytes_deflated += chunk.bytesize }
decide if @buf.size > BYTES_WRITTEN_THRESHOLD
end
self
end
def close
decide unless @winner
@winner.close
end
private def decide
# Finish and then close the deflater - it has likely buffered some data
@bytes_deflated += @deflater.finish.bytesize until @deflater.finished?
# If the deflated version is smaller than the stored one
# - use deflate, otherwise stored
ratio = @bytes_deflated / @buf.size.to_f
if ratio < 0.75
warn "using flate"
@winner = @streamer.write_deflated_file(@filename)
else
warn "using stored"
@winner = @streamer.write_stored_file(@filename)
end
@buf.rewind
IO.copy_stream(@buf, @winner)
end
end
def write_file(name)
writable = DeflateHeuristic.new(name, self)
yield(writable)
writable.close
end
end
ZipTricks::Streamer.include(HeuristicWrite)
io = StringIO.new
ZipTricks::Streamer.open(io) do |zip|
zip.write_file("this will be stored.bin") do |io|
90_000.times do
io << Random.bytes(1024)
end
end
zip.write_file("this will be deflated.bin") do |io|
90_000.times do
io << "many many delicious, compressible words"
end
end
end
warn io.size
@Urist-McUristurister @suhastech @fringd Now that the zip_tricks situation has been resolved - maybe we should think about an API for this? I've added a feature to zip_kit which excludes the failed file from the ZIP central directory and removes it from the path set - so you can add the same file multiple times if it fails halfway.
The issue is that you need to specify a retry policy - or have some code for it in some way. For example, you could have a retry policy that if one of files
does not get included correctly or fails downloading from remote, zipline continues to the next. Or that it does 5 attempts and then aborts. Or that it does 3 attempts with exponential backoff and then moves on to the next file if output still fails.
With "naked" zip_kit you could do that with something like https://github.com/kamui/retriable by doing
Retriable.retriable(on: TimeoutError) do
zip.write_file("my_stuff.bin") do |sink|
HTTP.get(the_url) { |response_body| IO.copy_stream(response_body, sink) }
end
end
The configuration for Retriable
could be passed into zipline
as a keyword argument, but then retriable
would become a dependency for zipline. It also does not allow specifying a policy such as "continue to the next file on that error", and the errors that can be raised by cloud services when fetching from remote are many.
if fetching a file over http fails after x/n bytes written it would be nice if we could try to resume using range requests to start the download at x bytes if supported
i'm guessing most CDNs support range requests
Storage backends like S3 do, but it is not guaranteed - also S3 will not respond to a HEAD, only to a GET - you will need to do a "start request", and so on. I mean - all of these are doable, what I am wondering about is how to make it sufficiently generic so that it would be "good enough for most folks" - or how to provide an extension point for this, alternatively. If you can resume an upstream download you can rescue inside of the write_file
block - but even ActiveStorage will not give you an API for this. Not now at least, unless you pre-chop ranges to fetch. There is a lot of ranged fetching work here https://github.com/WeTransfer/format_parser/blob/master/lib/remote_io.rb and https://github.com/WeTransfer/format_parser/blob/master/lib/care.rb and https://github.com/WeTransfer/format_parser/blob/master/lib/active_storage/blob_io.rb for instance, and it has (for one) to be HTTP-client-specific 🤔
Okay, attempting to resume gracefully would be a nice to have feature request i think. we can attempt it later, just handling errors gracefully would be a good start for now
Ok, I will need to a bit of refactoring then. What would be the default - "give up" if a file fails and print to the Rails log? This would need a refactor for all the file types I think (since you need to match on storage-specific exceptions)
i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from
i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from
Aight
We stumbled upon a tricky bug where no matter what, the ZIP produced by zipline was corrupt. Tracking it down was quite difficult.
It seems that the download of one of the S3 files failed here:
https://github.com/fringd/zipline/blob/master/lib/zipline/zip_generator.rb#L96
Possible solutions: