ex-aws / ex_aws

A flexible, easy to use set of clients AWS APIs for Elixir
https://hex.pm/packages/ex_aws
MIT License
1.28k stars 527 forks source link

Support Stream for S3.get_object #187

Closed heri16 closed 8 years ago

heri16 commented 8 years ago

In case downloads are interrupted, it would be nice to have a streaming interface for S3.get_object using Httpoison's process_response_chunk or similar.

benwilson512 commented 8 years ago

I'm not sure a stream for get_object itself makes sense. It's executed in a single HTTP request and everything is kept in memory. There isn't any state from which to resume.

I am planning on adding a S3.download type function that would handle concurrency, resuming if there's pre-existing file data, etc.

That said, if there was some kind of range specified I suppose I could return a stream that basically just pushed along that range. The issue is I'm not entirely sure that that's more useful than the concurrent downloads thing we're already talking about. I would want to build that first and then see if it could be reasonably applied here.

heri16 commented 8 years ago

Streams supported by httpoison are not kept in memory, because they are garbage collected. Don't see why a single http request is a problem at all, if the user don't want and don't need concurrency. There is no need for resuming feature. Interrupted streams in httpoison are just like interrupted downloads in a web browser. Single http request piping into a file. The user should handle what to do with the partial downloaded file. (E.g. For MP3 audio, the user could streaming chunks downstream to a web browser or transcoder). There is no need to support resuming directly.

On Monday, 25 July 2016, Ben Wilson notifications@github.com wrote:

I'm not sure a stream for get_object itself makes sense. It's executed in a single HTTP request and everything is kept in memory. There isn't any state from which to resume.

I am planning on adding a S3.download type function that would handle concurrency, resuming if there's pre-existing file data, etc.

That said, if there was some kind of range specified I suppose I could return a stream that basically just pushed along that range. The issue is I'm not entirely sure that that's more useful than the concurrent downloads thing we're already talking about. I would want to build that first and then see if it could be reasonably applied here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CargoSense/ex_aws/issues/187#issuecomment-234970525, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgK_c08xbkj9rB2VHZHGi18Ocn1Z3Opks5qZMkZgaJpZM4JT8lw .

benwilson512 commented 8 years ago

Ah so the issue here is that ExAws does not tie itself to any particular underlying HTTP library. I'm not entirely sure what you mean by httpoison streams not being in memory. If you discard each chunk as it comes in sure, but if you keep it around, then it's kept around.

More to the point, the notion of streaming and downloading in chunks that I've been referring to is via the Range header with bytes=x-y where it grabs bytes x to y in a request. It isn't clear to me that AWS S3 supports any other method. The documentation does not indicate any support for a chunked download. Regardless, individual http chunks that may or may not be surfaced by a particular HTTP library within a given request is beyond the scope of what I'm trying to address here.

benwilson512 commented 8 years ago

With respect to what the roadmap is, I've been actively developing a GenStage integration, and it's on the basis of that that any kind of streaming API will be based.

heri16 commented 8 years ago

Not sure how I can explain it better. Streaming support is useful when there is no need for concurrency. Each chunk comes in and is forwarded to disk or the network. And what is meant by a chunk here is when we request for the whole file, the body is returned to us in chunks. (As such, range bytes is not relevant for this issue.) Let me reiterate that only one request is sent to AWS.

Please see the httpoison documentation. Most other http libraries in elixir, including hackney, supports chunk-based interface.

I think both of us might be mixing up the other issue on concurrent piecemeal downloads with this issue. They are two different things.

On Wednesday, 27 July 2016, Ben Wilson notifications@github.com wrote:

Ah so the issue here is that ExAws does not tie itself to any particular underlying HTTP library. I'm not entirely sure what you mean by httpoison streams not being in memory. If you discard each chunk as it comes in sure, but if you keep it around, then it's kept around.

More to the point, the notion of streaming and downloading in chunks that I've been referring to is via the Range header with bytes=x-y where it grabs bytes x to y in a request. It isn't clear to me that AWS S3 supports any other method. The documentation does not indicate any support for a chunked download. Regardless, individual http chunks that may or may not be surfaced by a particular HTTP library within a given request is beyond the scope of what I'm trying to address here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CargoSense/ex_aws/issues/187#issuecomment-235468905, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgK_dhuWcMd6Zo_-v_97_zC2098SAdIks5qZsnygaJpZM4JT8lw .

heri16 commented 8 years ago

For example, instead of waiting for the whole body to come in over the network, we can intersperse processing between each megabyte as it comes over the network. This dramatically reduces latency, because we are not waiting for the whole document body, before processing it.

SweetXML also supports stream based decoding this way, It also mentions that latency and memory is reduced this way.

On Wednesday, 27 July 2016, Heri Sim heri16@gmail.com wrote:

Not sure how I can explain it better. Streaming support is useful when there is no need for concurrency. Each chunk comes in and is forwarded to disk or the network. And what is meant by a chunk here is when we request for the whole file, the body is returned to us in chunks. (As such, range bytes is not relevant for this issue.) Let me reiterate that only one request is sent to AWS.

Please see the httpoison documentation. Most other http libraries in elixir, including hackney, supports chunk-based interface.

I think both of us might be mixing up the other issue on concurrent piecemeal downloads with this issue. They are two different things.

On Wednesday, 27 July 2016, Ben Wilson <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Ah so the issue here is that ExAws does not tie itself to any particular underlying HTTP library. I'm not entirely sure what you mean by httpoison streams not being in memory. If you discard each chunk as it comes in sure, but if you keep it around, then it's kept around.

More to the point, the notion of streaming and downloading in chunks that I've been referring to is via the Range header with bytes=x-y where it grabs bytes x to y in a request. It isn't clear to me that AWS S3 supports any other method. The documentation does not indicate any support for a chunked download. Regardless, individual http chunks that may or may not be surfaced by a particular HTTP library within a given request is beyond the scope of what I'm trying to address here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CargoSense/ex_aws/issues/187#issuecomment-235468905, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgK_dhuWcMd6Zo_-v_97_zC2098SAdIks5qZsnygaJpZM4JT8lw .

ericmj commented 8 years ago

It is tricky to do this firstly because as @benwilson512 said it's hard to create a generic adapter interface for http clients. There is also the issue that most http clients do not support back-pressure for streaming afaik. Both gen_stage and streams require some sort of back-pressure if you don't want to potentially buffer a lot of data in memory.

I think this feature would be appreciated but because of the difficulties with it I think we need a proposal for an implementation at this point.

benwilson512 commented 8 years ago

Please see the httpoison documentation. Most other http libraries in elixir, including hackney, supports chunk-based interface.

Httpoison is literally just a wrapper for hackney, so that isn't saying much.

I think there's some ambiguity about what we mean by streaming here. Frequently when people talk about a Stream in elixir they mean a lazy Enumerable. This is how ExAws.stream! works, it returns a lazy Enumerable that can then be further composed over with the Stream module or consumed immediately via Enum.

HTTPoison streaming on the other hand is different, instead it means that it will send the requesting process a message as each part of the HTTP requests comes in.

ExAws.stream! only will support returning the former kind of streaming.

The issues with hooking directly into the latter kind of streaming are: 1) Back Pressure: If the processing can't keep up with the download rate, you get unbounded memory growth 2) Throughput: If the processing is far faster than what a given connection to S3 can provide, you're stuck waiting on chunks and can't leverage concurrency 3) Generality: Right now the bar for integrating an HTTP library with ExAws is handling the basic GET PUT POST and friends. Either I need to move to a position which says "Unless you use hackney not all features are available" or I have to handling integrating a variety of process architectures, message formats, and so forth. This is a non trivial effort, and even if completed we still have the issues above.

instead of waiting for the whole body to come in over the network, we can intersperse processing between each megabyte as it comes over the network.

We don't have to choose between using HTTPoison (or similar) streaming, and having the whole body. There is a third option: Send multiple requests each for a different megabyte chunk via the Range header. Then we still get megabyte by megabyte processing, but by integrating with the GenStage behaviours we get back pressure, as much throughput as your local network can handle, and easy integration into any other GenStage processes. This will also give us for free the ability to return a regular Elixir stream via https://hexdocs.pm/gen_stage/Experimental.GenStage.html#stream/1

benwilson512 commented 8 years ago

Fixed in #196

heri16 commented 8 years ago

Looked at the pull request. Still does not answer the purpose of doing a get_object without concurrent http connections. Each request to AWS is charged. What we require is a way to scan the incoming stream over a single http request, and stop when a word is found. The word could be in the first 10kb or the last 10kb. Doing multiple requests is wasteful.

ericmj commented 8 years ago

@heri16 As I said earlier we would need a way to communicate back-pressure to the http client or the http client could overflow the mailbox.

This feature has been built with genstage which handles back-pressure by explicitly sending demand requests to producers. This is incompatible with how current http clients work.

benwilson512 commented 8 years ago

In case this isn't clear, we will not be providing an option to have HTTP chunks streamed to a process inbox.

I believe such a capability could in fact be added without any change on our part by wiring up the right kind of HTTP adapter, but we have no intent on supporting, building, or otherwise providing that.

heri16 commented 8 years ago

@ericmj if you are referring to the :stream_to on this page (https://hexdocs.pm/httpoison/HTTPoison.html), then yes. But if you look carefully at this other page (https://github.com/benoitc/hackney), it is obvious that the concept of multipart is not same as stream. See stream_body and stream_next. They are both demand driven.

heri16 commented 8 years ago

@benwilson512 I am not saying that chunks should be sent to a process inbox. That could overflow the mailbox and is bad. See the demand driven approach of receiving on the socket when asked to here: https://github.com/benoitc/hackney . In general, the Erlang guys are better than the Elixir guys at understanding how BEAM sockets works.

Note 1: When {async, once} is used the socket will receive only once. To receive the other messages use the function hackney:stream_next/1.

ericmj commented 8 years ago

That's great, that means you can use hackney to build this. It doesn't sound like @benwilson512 is currently interested in building or maintaining this feature currently.

In general, it's better to build the feature yourself instead of repeatedly asking the maintainer to build it for you.

heri16 commented 8 years ago

Not surprised that Elixir HTTPoison has gotten it wrong, but the underlying Erlang Hackney has gotten it right by supporting demand_based fetch. If indeed it would be crucial for ex_aws to maintain generic compatibility with all Elixir http libraries (all of which stream into process mailbox), then maybe the right way forward is to communicate with the elixir community to fix this issue, and wait for elixir community to fix all its http clients with demand-based approach (using gen_stage or exposing the fetch next function of the underlying erlang). @josevalim

heri16 commented 8 years ago

Alright, I will close this for now and open an issue on httpoison page for GenStage (or demand based fetching of chunks). When most elixir http libraries support gen_stage producer, we could reopen this. Thanks!

ericmj commented 8 years ago

The issue isn't support in httpoison, this library only supports hackney out of the box.

benwilson512 commented 8 years ago

I think there may be value to this approach. Given that it's something of a secondary feature I think limits on which http libraries are supported is acceptable for now.

I can't say I'm a big fan of the Elixir developers vs Erlang developers comparison. For one thing I consider myself an Elixir developer, and while we have had some disagreements about various implementation concerns I am nonetheless still working very hard here to provide a feature you asked for.

For another thing, it prompted you to misquote me. The stated goal is not to support "All elixir http libraries". HTTPoison isn't even the default library anymore, hackney is. The goal is to provide a simple generic http interface such that anyone can use whatever library they're already using, Elixir or otherwise.

GenStage is still experimental, I would not expect HTTPoison to use it for this. I'm not entirely convinced it's the right tool for what you want from hackney anyway.

Lastly, as I've become more familiar with the API in and around GenStage, I'm skeptical that it makes sense to return as a result from an ExAws.stream! call. I will be looking into using Stream.resource to wrap the hackney demand driven streaming. I think any integrations ExAws does via GenStage will need an API different from ExAws.stream!

josevalim commented 8 years ago

In general, the Erlang guys are better than the Elixir guys at understanding how BEAM sockets works.

So go use Erlang. I am having a hard time understanding why you are so interested in ExAWS while believing we generally don't understand how BEAM works. There is probably an AWS library in Erlang, since they do a better job than us, go use that. The languages are interoperable for a reason.

José Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

heri16 commented 8 years ago

That is not the response I was expecting from the respectable @josevalim . Please understand the context here: it took us almost 20 posts just to establish what is meant by this enhancement, which is odd, and it is pretty sad to see this. Time and again I have been trying to explain that this issue is not lazy enumeration, neither is it concurrent download of chunks, which is something that I already opened another issue for sometime ago (#173). My intention on mentioning that the comparison was not to offend, but to highlight the fact that there is a lot of confusion on what is meant by streaming implementation of S3.get_object over one socket. Sorry if I offended anyone.

bruce commented 8 years ago

@heri16 I think you really need to review your tone during this conversation. I don't see how "the Erlang guys are better at understanding how BEAM sockets work" is at all conducive to a good relationship with either a project's maintainers, a programming language community, or a language designer. Even in this latest, you want to make sure that you point out that it's "sad" that you had to "time and again" had to explain something (rather than that you see value in collaborating on a solution). Perhaps your time would have been better spent opening a PR and simply contributing to the project than trying to educate those that are, from your tone, clearly not at your level.

heri16 commented 8 years ago

My full intent and purpose is to ensure that this Elixir library is on par, if not greater than the competition, which is the spirit of open source. See #179 where I was trying to raise some important issue that we all should be aware about. I was not expecting such hostility on this, because so far the elixir community has been great, open and friendly.

heri16 commented 8 years ago

@bruce thanks for the input. I am still learning and new to all of this, and is in no way at any level at all.

benwilson512 commented 8 years ago

I think the ultimate source of this issue is that when you brought up streaming for S3.get_object what you had in mind from the very beginning was surfacing hackney style http streaming. The problem is, a Stream is already a well defined thing in ExAws. It means a lazily initialized enumerable returned when pass S3.get_object operation to ExAws.stream!.

I have been conducting the rest of this discussion with the understanding that whether to use hackney http streaming or multiple concurrent requests an implementation detail of that enumerable.

It seems what we really have are 2 separate feature requests. 1) S3.get_object |> ExAws.stream! #=> enumerable 2) S3.get_object |> ExAws.request!(http_opts: [stream_to: self, async: :once]) #=> {:ok, hackney_ref}

I will be breaking those up into two different issues and we can further the discussion there.