WeTransfer / zip_tricks

[DEPRECATED] Compact ZIP file writing/reading for Ruby, for streaming applications
Other
348 stars 32 forks source link

Question: Is it possible to use zip_tricks with byte ranges to support download resuming? #109

Closed JulieWinchester closed 1 year ago

JulieWinchester commented 3 years ago

Thanks for putting this gem out there, it's great!

I see from the readme it's possible to estimate the total size of ZIP, and therefore the Content-Length header can be set correctly. Would it be possible to use zip_tricks in conjunction with byte range serving so that if a user downloading a streaming ZIP had their connection interrupted, they would be able to resume the download? Or is that not feasible due to how the ZIPs are being constructed on the fly? Thanks!

julik commented 3 years ago

Hey! It is possible but you do need more than just zip_tricks to accomplish this. We use a combination of zip_tricks and https://github.com/WeTransfer/interval_response and we cache the generated chunks of the ZIP file (not the bodies of the files we stitch into the output) upfront. It is not instant but enables the functionality that you are after. You can check out this presentation too: https://speakerdeck.com/julik/streaming-large-files-with-ruby

JulieWinchester commented 3 years ago

Thanks much for pointing me to interval_response and the link to the presentation, these have been extremely helpful. I hope you don't mind if I ask 2 further questions?

I've been able to put together a minimal proof of concept for streaming resumable dynamically generated ZIP files, like so:

# Setup
io = StringIO.new("")
file = open("http://127.0.0.1:8984/rest/dev/00/02/00/01/000200016/files/60f258a2-1deb-4bf5-af61-675f2298bda6")
file_crc32 = 1333578197
file_size = 2571251335

# Mock out zip
ZipTricks::Streamer.open(io) do |zip|
  # raw_file is written "as is" (STORED mode).
  # Write the local file header first..
  zip.add_stored_entry(filename: "large_file.txt", size: file_size, crc32: file_crc32)
  @post_local_header_position = io.tell

  # Adjust the ZIP offsets within the Streamer
  zip.simulate_write(file_size)
end

# Get zip headers
io.rewind
local_header = io.read(@post_local_header_position)
central_header = io.read
segments = [
  local_header,
  IntervalResponse::LazyFile.new(file),
  central_header,
]

# Construct interval and send response
interval_sequence = IntervalResponse::Sequence.new(*segments)
interval_response = IntervalResponse.new(interval_sequence, {})
interval_response.to_rack_response_triplet

After finishing this, I found myself with 2 additional questions I'm not quite sure how to answer.

  1. Is there a better way to be "hand-crafting" the local and central zip headers? I know I could be directly working with ZipTricks::ZipWriter, but I suppose it felt like then I might be losing a lot of the heavy lifting zip_tricks is doing behind the scenes.
  2. This example uses open-uri to open a large remote HTTP file resource as a Tempfile. That seems relatively memory efficient to me, but it does use up disk temp space and also requires the user to wait while the tempfile is copied before anything gets started. The impression I got from your presentation is that you use Patron to solve this issue, grabbing chunks of remote HTTP resources part by part and sending them on, but I wasn't exactly sure how to do this in a way that interfaces with interval_response. Would the trick be writing a customized "LazyHttpResource"-ish class kind of based on IntervalResponse::Sequence to keep the remote URL and then grab and yield remote resource chunks as a loop when needed? (By the way, I know this example is pulling from localhost and it would be much easier to just go work directly with the files on the local filesystem, but we work with Duraspace Fedora which practically speaking forces us to use HTTP here, ha!)

Thanks again for your time, and all the help so far!

julik commented 3 years ago

The way we do it is with a customised writer for ZipTricks::Streamer. We write using that writer into an intermediary format which has a list of binary strings and URLs of included segments (we use STORED mode so they can be "spliced" into the output stream). This is very close to what you are doing:

  class DrainableIO < StringIO
    def to_segment_and_clear
      string.tap do
        truncate(0)
        rewind
      end
    end
  end

RemoteInclusion = Struct.new(:url, :bytesize)

and then

    segments = []
    drainable = DrainableIO.new
    zip = ZipTricks::Streamer.new(drainable)
    file_entries.each do |file_entry|
        zip.add_stored_entry(filename: file_entry.path, crc32: file_entry.crc32, size: file_entry.bytesize, modification_time: file_entry.created_at)
        segments << drainable.to_segment_and_clear
        segments << RemoteInclusion.new(file_entry.url, file_entry.bytesize}
        zip.simulate_write(file_entry.bytesize)
      end
      zip.close
      segments << drainable.to_segment_and_clear
    end

What comes next is tricky - you might need to write your own response body class for interval_response. For example, you can use HTTP.rb to retrieve the upstream segments using ranges (using a subclass of the RackBodyWrapper from interval_response for example):

def each
  buf = String.new(capacity: @chunk_size)
  @interval_response.each do |segment, range_in_segment|
    when String
      with_each_chunk(range_in_segment) do |offset, read_n|
        yield segment.slice(offset, read_n)
      end
    when RemoteInclusion
      hh = {'Range' => "bytes=#{range_in_segment.begin}-#{range_in_segment.end}"}
      response = HTTP.get(segment.url,  headers: hh)
      while chunk = response.readpartial
        yield(chunk)
      end
    end
  end
ensure
  buf.clear
end

Then roughly what you have, but don't forget to pass the Rack headers to interval response otherwise it won't know which HTTP range the client wants ;-)

interval_sequence = IntervalResponse::Sequence.new(*segments)
interval_response = IntervalResponse.new(interval_sequence, env) # <= here
s, h, b = interval_response.to_rack_response_triplet

Come to think of it, interval_response does not contain a way to inject your implementation of the RackBodyWrapper - if you need this I think I could add it, or add a way to "take" the interval response object out of the provided RackBodyWrapper and place it inside your own?..

If you want to use it with tempfiles and Patron like we do then another bit you might need the combination of https://github.com/WeTransfer/fast_send and Patron's get_file method. But if your application is not extremely high-traffic you might get away with just a streaming HTTP client like above.

JulieWinchester commented 1 year ago

I'm looping back to this 1.5 years later, but I wanted to say that @julik your response was immensely helpful and greatly appreciated, thanks very much! I'll go ahead and close this issue out.