Closed JulieWinchester closed 1 year ago
Hey! It is possible but you do need more than just zip_tricks to accomplish this. We use a combination of zip_tricks and https://github.com/WeTransfer/interval_response and we cache the generated chunks of the ZIP file (not the bodies of the files we stitch into the output) upfront. It is not instant but enables the functionality that you are after. You can check out this presentation too: https://speakerdeck.com/julik/streaming-large-files-with-ruby
Thanks much for pointing me to interval_response and the link to the presentation, these have been extremely helpful. I hope you don't mind if I ask 2 further questions?
I've been able to put together a minimal proof of concept for streaming resumable dynamically generated ZIP files, like so:
# Setup
io = StringIO.new("")
file = open("http://127.0.0.1:8984/rest/dev/00/02/00/01/000200016/files/60f258a2-1deb-4bf5-af61-675f2298bda6")
file_crc32 = 1333578197
file_size = 2571251335
# Mock out zip
ZipTricks::Streamer.open(io) do |zip|
# raw_file is written "as is" (STORED mode).
# Write the local file header first..
zip.add_stored_entry(filename: "large_file.txt", size: file_size, crc32: file_crc32)
@post_local_header_position = io.tell
# Adjust the ZIP offsets within the Streamer
zip.simulate_write(file_size)
end
# Get zip headers
io.rewind
local_header = io.read(@post_local_header_position)
central_header = io.read
segments = [
local_header,
IntervalResponse::LazyFile.new(file),
central_header,
]
# Construct interval and send response
interval_sequence = IntervalResponse::Sequence.new(*segments)
interval_response = IntervalResponse.new(interval_sequence, {})
interval_response.to_rack_response_triplet
After finishing this, I found myself with 2 additional questions I'm not quite sure how to answer.
ZipTricks::ZipWriter
, but I suppose it felt like then I might be losing a lot of the heavy lifting zip_tricks is doing behind the scenes. IntervalResponse::Sequence
to keep the remote URL and then grab and yield remote resource chunks as a loop when needed? (By the way, I know this example is pulling from localhost and it would be much easier to just go work directly with the files on the local filesystem, but we work with Duraspace Fedora which practically speaking forces us to use HTTP here, ha!)Thanks again for your time, and all the help so far!
The way we do it is with a customised writer for ZipTricks::Streamer
. We write using that writer into an intermediary format which has a list of binary strings and URLs of included segments (we use STORED mode so they can be "spliced" into the output stream). This is very close to what you are doing:
class DrainableIO < StringIO
def to_segment_and_clear
string.tap do
truncate(0)
rewind
end
end
end
RemoteInclusion = Struct.new(:url, :bytesize)
and then
segments = []
drainable = DrainableIO.new
zip = ZipTricks::Streamer.new(drainable)
file_entries.each do |file_entry|
zip.add_stored_entry(filename: file_entry.path, crc32: file_entry.crc32, size: file_entry.bytesize, modification_time: file_entry.created_at)
segments << drainable.to_segment_and_clear
segments << RemoteInclusion.new(file_entry.url, file_entry.bytesize}
zip.simulate_write(file_entry.bytesize)
end
zip.close
segments << drainable.to_segment_and_clear
end
What comes next is tricky - you might need to write your own response body class for interval_response
. For example, you can use HTTP.rb to retrieve the upstream segments using ranges (using a subclass of the RackBodyWrapper
from interval_response
for example):
def each
buf = String.new(capacity: @chunk_size)
@interval_response.each do |segment, range_in_segment|
when String
with_each_chunk(range_in_segment) do |offset, read_n|
yield segment.slice(offset, read_n)
end
when RemoteInclusion
hh = {'Range' => "bytes=#{range_in_segment.begin}-#{range_in_segment.end}"}
response = HTTP.get(segment.url, headers: hh)
while chunk = response.readpartial
yield(chunk)
end
end
end
ensure
buf.clear
end
Then roughly what you have, but don't forget to pass the Rack headers to interval response otherwise it won't know which HTTP range the client wants ;-)
interval_sequence = IntervalResponse::Sequence.new(*segments)
interval_response = IntervalResponse.new(interval_sequence, env) # <= here
s, h, b = interval_response.to_rack_response_triplet
Come to think of it, interval_response
does not contain a way to inject your implementation of the RackBodyWrapper - if you need this I think I could add it, or add a way to "take" the interval response object out of the provided RackBodyWrapper and place it inside your own?..
If you want to use it with tempfiles and Patron like we do then another bit you might need the combination of https://github.com/WeTransfer/fast_send and Patron's get_file
method. But if your application is not extremely high-traffic you might get away with just a streaming HTTP client like above.
I'm looping back to this 1.5 years later, but I wanted to say that @julik your response was immensely helpful and greatly appreciated, thanks very much! I'll go ahead and close this issue out.
Thanks for putting this gem out there, it's great!
I see from the readme it's possible to estimate the total size of ZIP, and therefore the
Content-Length
header can be set correctly. Would it be possible to use zip_tricks in conjunction with byte range serving so that if a user downloading a streaming ZIP had their connection interrupted, they would be able to resume the download? Or is that not feasible due to how the ZIPs are being constructed on the fly? Thanks!