fringd / zipline

A gem that lets you stream a zip file from rails
MIT License
289 stars 68 forks source link

Memory usage #61

Closed mib32 closed 3 years ago

mib32 commented 4 years ago

Hi Ram! Thank you very much for this gem (literally), I can not say how I am amazed with the combination of simplicity and functionality it gives.

However, it does consume pretty much memory, which, is, of course, always was a big issue with combination of Ruby and streaming. Some times I really wonder, where the memory goes? If we are not really creating massive objects, just give out small chunks of data to the client...

I wonder if you know some tricks that could help save the memory.

Thanks one more time

fringd commented 4 years ago

Yeah it shouldn't use that much. If something you are doing is reading the whole input file in before sending it to zipline that could be the issue. Can you describe how you are using the gem? If this is an issue with zipline that's a big deal since a big reason for this gem is to make zipping work in low storage/memory environments.

mib32 commented 4 years ago

@fringd So I use it like this:

files_list = [
 ["https://url.to/file1", 'file_about_50_mb'],
 ["https://url.to/file2", 'file_about_50_mb'],
 ["https://url.to/file3", 'file_about_50_mb']
 .... two more files, so the archive is about 220 mb
]
zipline(files_list, 'archive.zip')

On production, with puma, if I make a request to this, my app's memory quickly goes from 160M to 300M, and it then stays always on that level. And if its two or three parallel requests, then it's 360M and 420M. I also noticed that in case of parallel requests, my nginx goes from 10M to 110M for a bit of a time, but quickly releases it back, interesting.

I had tried to set MALLOC_ARENA_MAX to 2, but doesn't seem to affect pretty much. My rails is 5.2.4, puma 4.3.3, in a cluster mode with 1 worker and 0-14 threads.

fringd commented 4 years ago

Is it possible puma is not streaming the response? Does the first byte download right away for you when you click the download link?

On Tue, Jun 23, 2020 at 9:57 AM Antoni Mur notifications@github.com wrote:

@fringd https://github.com/fringd So I use it like this:

files_list = [ ["https://url.to/file1", 'file_about_50_mb'], ["https://url.to/file2", 'file_about_50_mb'], ["https://url.to/file3", 'file_about_50_mb'] .... two more files, so the archive is about 220 mb ] zipline(files_list, 'archive.zip')

On production, with puma, if I make a request to this, my app's memory quickly goes from 160M to 300M. And if its two or three parallel requests, then it's 360M and 420M. I also noticed that in case of parallel requests, my nginx goes from 10M to 110M for a bit of a time, but quickly releases it back, interesting.

I had tried to set MALLOC_ARENA_MAX to 2, but doesn't seem to affect pretty much. My rails is 5.2.4, puma 4.3.3, in a cluster mode with 1 worker and 0-14 threads.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fringd/zipline/issues/61#issuecomment-648166703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCSRPSCZNVDO7C4MWURH3RYCYFZANCNFSM4OESLIRQ .

mib32 commented 4 years ago

Yes, it does stream, starts immediately, has no Content-Length.

чт, 25 июн. 2020 г. в 23:35, Ram Dobson notifications@github.com:

Is it possible puma is not streaming the response? Does the first byte download right away for you when you click the download link?

On Tue, Jun 23, 2020 at 9:57 AM Antoni Mur notifications@github.com wrote:

@fringd https://github.com/fringd So I use it like this:

files_list = [ ["https://url.to/file1", 'file_about_50_mb'], ["https://url.to/file2", 'file_about_50_mb'], ["https://url.to/file3", 'file_about_50_mb'] .... two more files, so the archive is about 220 mb ] zipline(files_list, 'archive.zip')

On production, with puma, if I make a request to this, my app's memory quickly goes from 160M to 300M. And if its two or three parallel requests, then it's 360M and 420M. I also noticed that in case of parallel requests, my nginx goes from 10M to 110M for a bit of a time, but quickly releases it back, interesting.

I had tried to set MALLOC_ARENA_MAX to 2, but doesn't seem to affect pretty much. My rails is 5.2.4, puma 4.3.3, in a cluster mode with 1 worker and 0-14 threads.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fringd/zipline/issues/61#issuecomment-648166703, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACCSRPSCZNVDO7C4MWURH3RYCYFZANCNFSM4OESLIRQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fringd/zipline/issues/61#issuecomment-649803749, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5MF3H7WWJO4OIB3RBOB43RYOYIBANCNFSM4OESLIRQ .

fringd commented 4 years ago

Hmm Did something change in the API for HTTP::Net?

I have a guess what this might be, could you help me test out a fix?

fringd commented 4 years ago

wait, no it all looks right to me...Is there any chance this is an issue with other middleware you have configured? I feel like this would be a bigger issue affecting more people if it wasn't.

Can you dig around and see if there's anything else that might be wrong? If not can you try to come up with a minimal reproduction of the error?

mecampbellsoup commented 4 years ago

Usually in these cases you are instantiating a large number of (large?) objects in memory because you start streaming.

julik commented 3 years ago

@mib32 I don't know what your particular situation is like, but I would suspect that it is one of the two things:

Your choice of HTTP client. Different HTTP clients have different memory footprint when doing this kind of work, and last time I was experimenting next to none had constant-memory downloading from the origin. So what might be happening is that your HTTP client is maintaining internal buffers or slicing strings too much, which leads to memory bloat. When I was solving this, the solution we ended up with was downloading directly to a file on the filesystem (with multiple Range requests) and then using io.sendfile to deliver the file into the socket - this allowed us to bypass Ruby allocations for the actual bytes completely. This is also why we need the offset adjustment in zip_tricks.

Something in your Rack stack is buffering even though you say that the response begins immediately. I haven't seen this happen in the wild but might be the case.

Likely a sampling memory profile for Ruby which measures allocations could help you. This might be helpful https://speakerdeck.com/julik/streaming-large-files-with-ruby?slide=16

mib32 commented 3 years ago

@julik Thanks Julik! I will check that out, sounds reasonable