fringd / zipline

A gem that lets you stream a zip file from rails
MIT License
289 stars 68 forks source link

Need better error handling when there is a failure to fetch a file #78

Open suhastech opened 3 years ago

suhastech commented 3 years ago

We stumbled upon a tricky bug where no matter what, the ZIP produced by zipline was corrupt. Tracking it down was quite difficult.

It seems that the download of one of the S3 files failed here:

https://github.com/fringd/zipline/blob/master/lib/zipline/zip_generator.rb#L96

file.download { |chunk| writer_for_file << chunk }
  S3 Storage (46.4ms) Downloaded file from key: 123
Traceback (most recent call last):]
Aws::S3::Errors::NotFound ()

Possible solutions:

fringd commented 2 years ago

yeah we should handle this better.

I'm not sure how, but if we could abort the whole download that's probably the best option...

fringd commented 1 year ago

i googled a bit for some possible solutions here... https://stackoverflow.com/questions/15305203/what-to-do-with-errors-when-streaming-the-body-of-an-http-request

these two options seem like the best leads:

  1. Write a malformed chunk (and close the connection) which will trigger client error
  2. Add a http trailer telling your client that something went wrong.

the trailer seems pretty robust, but i'm not sure how to do that. https://www.rubydoc.info/gems/rack/Rack/Chunked is the only mention of trailers in rails i could find.

a malformed chunk? not sure how to make rails do that either.

if anybody else has any clues, let me know

Urist-McUristurister commented 1 year ago

Hey, just an idea, it may be stupid, but may be not, so...

If zipline fails to fetch a file, or any other error occurs:

I don't know the internal mechanism of zipline, so, if skipping is not possible, then write the closing sequence for the broken file (if needed/possible), write the "failures.log", and finish up with proper zip header (well, footer, technically) - to allow client to unarchive at least what have been fetched so far.

Optionally, add a configuration/options/settings/initialiser to customise the name and content (using proc) of the failures log file (or disable it completely).

WDYT?...

fringd commented 1 year ago

this might be possible if the failure happened at the beginning. if it happened in the middle of the file that data has already been streamed out to the user and it's not possible to walk it back as far as i know. i think i'd rather abort than have a chance that the user thinks this file is valid.

On Wed, Jun 21, 2023 at 3:38 AM Urist McUristurister < @.***> wrote:

Hey, just an idea, it may be stupid, but may be not, so...

If zipline fails to fetch a file, or any other error occurs:

  • log this
  • if it was a failure to fetch a file, add the file url to some kind of internal @skipped list
  • if possible, do NOT add the failed file to archive. If not, better to have a working archive with a few broken files, than a whole broken archive. Especially if it's multiple GBs in size.
  • move on to the next item on the list
  • In the end, add a text file named, say, "failures.log", in which list the @skipped files with corresponding http errors

I don't know the internal mechanism of zipline, so, if skipping is not possible, then write the closing sequence for the broken file (if needed/possible), write the "failures.log", and finish up with proper zip header (well, footer, technically) - to allow client to unarchive at least what have been fetched so far.

Optionally, add a configuration/options/settings/initialiser to customise the name and content (using proc) of the failures log file (or disable it completely).

WDYT?...

— Reply to this email directly, view it on GitHub https://github.com/fringd/zipline/issues/78#issuecomment-1600336958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCSRKTWH6YT5OECDUXJ7DXMKQJFANCNFSM45UQOWHA . You are receiving this because you commented.Message ID: @.***>

fringd commented 1 year ago

that said, an errors.log would be better than nothing. i'll keep this in mind if we try the other options and can't get them working.

Urist-McUristurister commented 1 year ago

if it happened in the middle of the file that data has already been streamed out to the user and it's not possible to walk it back as far as i know

But there's no need to walk back.

Before switching to zipline, for many years we've used a "classic" approach of downloading files to the worker machine, zipping them, and then sending to the client. And from that experience, vast majority of errors related to remote files happen when trying to fetch a file - for example, a 404 or 5XX, or any other type of error. For something to happen when the remote file is in the middle of transfer - I can't say that I have ever encountered this. So if it's possible to just log it and skip it - that would be best.

But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list? Because it is still much more preferable to have a log + corrupted 20mb file + 1999 full files, rather than a broken 15Gb archive that you can't even examine nor salvage.

fringd commented 1 year ago

yeah actually that makes some sense to me... this gem is mostly for those giant downloads, so maybe just getting partial success is generally better. my worry is the user will not know that the 1 file in there is actually corrupt.

Urist-McUristurister commented 1 year ago

my worry is the user will not know that the 1 file in there is actually corrupt.

Hence the error.log ;) Better 1 corrupt file, than 15gb corrupt archive...

julik commented 9 months ago

The problem is that a ZIP file which has been "chopped off" at the end is likely not to unarchive well, because the list of files inside the ZIP comes at the end of file. You can read the ZIP "straight-ahead" but in presence of data descriptors this is likely to fail on some apps, and it is very inefficient. It can also be faulty - for example, if entries are themselves ZIP files. What you need to make this work is a resumable download, similar to what we've done at WeTransfer back in the day. You might be able to put something together using raw zip_tricks and interval_response so that when you address the download again you will be able to resume your download. It is not trivial to do by any means but doable.

error.log would be a nice feature, albeit that when your download interrupts you already can't add anything to the response.

But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list?

If your concern is the "upstream" (where you are pulling the file from) timing out on you or similar, you can pull data from it in chunks and resume where it interrupts. HTTP Range headers can be used for that.

Urist-McUristurister commented 9 months ago

@julik I'm not talking about "chopping off" zip files, I'm talking about chopped off files that are being requested. Or, which is actually more frequent, non existing, 404 files, that zipline is fetching from the url. If I specify a list of 500+ urls (S3, for example), and one of them in the middle of the list does not exist, then the whole 50gb archive is botched. And the frustration about downloading 30Gb file which is suddenly broken, and then you try to download it again, just to find out that it's still broken, is immense.

I'm not talking about chopping the zip file. I'm talking about at least skipping the failed archivable file. If the file being fetched is throwing an error, any error, then either skip writing it altogether or write what have been fetched so far (say, 50mb of 150), if it's a partially fetched file, and move on to the next file. And in the very end, in the list of the zipped files, say "ok, this file is 50mb in size, next".

That way when user downloads the archive, it will have only 1 file in the whole archive "broken" (zero-sized, partial, or not present at all), not the whole archive itself.

julik commented 9 months ago

That should actually work if you surround your call to write_deflated_file with a block with a rescue. If you do that, zip_tricks will ~move on to the next file and not add the file you have written to the central directory.~ The likelihood is high that the unarchiving software opening the file will be able to unarchive correctly based on the central directory, which gets output in the end. Try adding a rescue surrounding ZipGenerator#write_file and try with some large-ish outputs interrupting them halfway?

Update - I was wrong, because even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form. You would need to muck about with the internals a bit:

module SaferZipWrite
  # Removes a file that does not finish writing from the
  # @files array in the Streamer, so that the file will not
  # be present in the ZIP central directory. Some unarchivers
  # might then still be able to unarchive files which do complete writing.
  class StreamerWrapper < SimpleDelegator
    def write_deflated_file(...)
      files = __getobj__.instance_variable_get("@files")
      n_files_before_writing = files.size
      super
    rescue => e
      # do something with the exception
      files.pop if files.size > n_files_before_writing
    end

    def write_stored_file(...)
      files = __getobj__.instance_variable_get("@files")
      n_files_before_writing = files.size
      super
    rescue => e
      # do something with the exception
      files.pop if files.size > n_files_before_writing
    end
  end

  def write_file(streamer, file, name, options)
    super(StreamerWrapper.new(streamer), file, name, options)
  end
end

Zipline::ZipGenerator.prepend(SaferZipWrite)
Urist-McUristurister commented 9 months ago

Well... I actually did something similar: I've patched the write_file method to first attempt the download of the target file, with up to 3 retries, in case of failure after 3 retries write a text file with a same name and error message, and then proceed with that file to streamer.write_stored_file (stored because we're serving pictures mostly, no BMPs, so it doesn't really make much sense to compress them even further. A "compression" option for zipline method, maybe?). Sure, it's slower, since it has to download the file first, but more reliable. Still, not something that would be fit for everyone.

I'll try your suggestion, though, thanks! 👍 Maybe I don't have to download the whole file after all...

Still, would be nice to have this feature built-in ;)

Urist-McUristurister commented 9 months ago

even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form

That's still much more preferable than a broken archive! 👍 😉

julik commented 9 months ago

"compression" option for zipline method, maybe?

that is for @fringd to ponder 😆 The feature would be nice, indeed - I'd wager it would have to be added in zip_tricks though and I haven't yet sorted the maintenance there. There could be a heuristic in zipline where it picks stored or deflated depending on the filename extension for instance. Note that storing a stored Excel or Word file can lead to interesting bugs because if a program reads a ZIP "straight ahead" it will sometimes think that the parts of those Office files are themselves entries of the "outer" ZIP 🤡 I remember the macOS ArchiveUtility used to do this, but don't know at which version they have fixed it.

fringd commented 9 months ago

PRs welcome of course. I'm assuming there isn't enough local storage for files generally though, so anything that waits for a whole file before continuing is a non-starter 😕

  1. Maybe we could just abandon the data and remove the failed file from the list of central directory entries. Even include an "ERRORLOG.txt" file or something in the archive saying that one or more files failed to fetch. i'm not 100% sure, but i think it's valid to have a bunch of garbage data in the zip file that is never pointed to by the central directory entries.
  2. Code that at least attempts to recover if the download fails would be welcome... trying to restart or resume and continue feeding data into the output stream some way without abandoning the written data... or alternatly abandon the already written data, again discarding the central directory entry, but this time adding a new one and starting over and hoping this works.
  3. i think the cost of deflating files that don't benefit is negligible, so it's probably not worth the trouble to do any fancy extension heuristics
julik commented 8 months ago

For those pondering "heuristic" choice between using stored or deflated modes, here is something that could work nicely. It will monitor how well the input compresses (you could also pick a ratio or something), and buffer roughly that amount of data. Then it will pick the more efficient option and open a file for you in the Streamer.

module HeuristicWrite
  class DeflateHeuristic
    BYTES_WRITTEN_THRESHOLD = 65*1024

    def initialize(filename, streamer)
      @filename = filename
      @streamer = streamer

      @buf = StringIO.new.binmode
      @deflater = ::Zlib::Deflate.new(Zlib::DEFAULT_COMPRESSION, -::Zlib::MAX_WBITS)
      @bytes_deflated = 0

      @winner = nil
    end

    def <<(bytes)
      if @winner
        @winner << bytes
      else
        @buf << bytes
        @deflater.deflate(bytes) { |chunk| @bytes_deflated += chunk.bytesize }
        decide if @buf.size > BYTES_WRITTEN_THRESHOLD
      end
      self
    end

    def close
      decide unless @winner
      @winner.close
    end

    private def decide
      # Finish and then close the deflater - it has likely buffered some data
      @bytes_deflated += @deflater.finish.bytesize until @deflater.finished?

      # If the deflated version is smaller than the stored one
      # - use deflate, otherwise stored
      ratio = @bytes_deflated / @buf.size.to_f
      if ratio < 0.75
        warn "using flate"
        @winner = @streamer.write_deflated_file(@filename)
      else
        warn "using stored"
        @winner = @streamer.write_stored_file(@filename)
      end
      @buf.rewind
      IO.copy_stream(@buf, @winner)
    end
  end

  def write_file(name)
    writable = DeflateHeuristic.new(name, self)
    yield(writable)
    writable.close
  end
end
ZipTricks::Streamer.include(HeuristicWrite)

io = StringIO.new
ZipTricks::Streamer.open(io) do |zip|
  zip.write_file("this will be stored.bin") do |io|
    90_000.times do
      io << Random.bytes(1024)
    end
  end

  zip.write_file("this will be deflated.bin") do |io|
    90_000.times do
      io << "many many delicious, compressible words"
    end
  end
end
warn io.size
julik commented 7 months ago

@Urist-McUristurister @suhastech @fringd Now that the zip_tricks situation has been resolved - maybe we should think about an API for this? I've added a feature to zip_kit which excludes the failed file from the ZIP central directory and removes it from the path set - so you can add the same file multiple times if it fails halfway.

The issue is that you need to specify a retry policy - or have some code for it in some way. For example, you could have a retry policy that if one of files does not get included correctly or fails downloading from remote, zipline continues to the next. Or that it does 5 attempts and then aborts. Or that it does 3 attempts with exponential backoff and then moves on to the next file if output still fails.

With "naked" zip_kit you could do that with something like https://github.com/kamui/retriable by doing

Retriable.retriable(on: TimeoutError) do
  zip.write_file("my_stuff.bin") do |sink|
    HTTP.get(the_url) { |response_body| IO.copy_stream(response_body, sink) }
  end
end

The configuration for Retriable could be passed into zipline as a keyword argument, but then retriable would become a dependency for zipline. It also does not allow specifying a policy such as "continue to the next file on that error", and the errors that can be raised by cloud services when fetching from remote are many.

fringd commented 7 months ago

if fetching a file over http fails after x/n bytes written it would be nice if we could try to resume using range requests to start the download at x bytes if supported

fringd commented 7 months ago

i'm guessing most CDNs support range requests

julik commented 7 months ago

Storage backends like S3 do, but it is not guaranteed - also S3 will not respond to a HEAD, only to a GET - you will need to do a "start request", and so on. I mean - all of these are doable, what I am wondering about is how to make it sufficiently generic so that it would be "good enough for most folks" - or how to provide an extension point for this, alternatively. If you can resume an upstream download you can rescue inside of the write_file block - but even ActiveStorage will not give you an API for this. Not now at least, unless you pre-chop ranges to fetch. There is a lot of ranged fetching work here https://github.com/WeTransfer/format_parser/blob/master/lib/remote_io.rb and https://github.com/WeTransfer/format_parser/blob/master/lib/care.rb and https://github.com/WeTransfer/format_parser/blob/master/lib/active_storage/blob_io.rb for instance, and it has (for one) to be HTTP-client-specific 🤔

fringd commented 7 months ago

Okay, attempting to resume gracefully would be a nice to have feature request i think. we can attempt it later, just handling errors gracefully would be a good start for now

julik commented 7 months ago

Ok, I will need to a bit of refactoring then. What would be the default - "give up" if a file fails and print to the Rails log? This would need a refactor for all the file types I think (since you need to match on storage-specific exceptions)

fringd commented 7 months ago

i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from

julik commented 7 months ago

i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from

Aight