arduino-libraries / ArduinoHttpClient

Arduino HTTP Client library
282 stars 170 forks source link

Add support for chunked encoding and larger bodies #9

Closed zonque closed 7 years ago

zonque commented 8 years ago

This adds support for chunked transfer encodings and larger body writes. It also makes HttpClient::responseBody() more robust, similar to what #7 does.

Replaces / conflicts with #7 Fixes #3

zonque commented 7 years ago

Is this fork actively maintained? As most modern web server implementations send content in chunks, I would much like to see these patches merged. Which of the many forks is considered the canonical upstream?

sandeepmistry commented 7 years ago

Hi @zonque,

Thanks for submitting this pull request!

I think it would be better to integrate the chunked encoding support in HttpClient::read() and HttpClient::available(). HttpClient::responseBody() is a higher level connivence API. Then the behaviour in #7 can be left alone.

What do you think?

zonque commented 7 years ago

I think it would be better to integrate the chunked encoding support in HttpClient::read() and HttpClient::available(). HttpClient::responseBody() is a higher level connivence API. Then the behaviour in #7 can be left alone.

Hmm. I played around with that idea and I'm not sure whether I agree. The chunk reader must know about higher-lever body vs. header details etc, and calling up into those layers from read() and available() seems like a violation. Also, from a user's perspective, I'd rather expect HttpClient::read() to return the actual bytes on the wire, including the chunk marks, and only the higher-level API calls to abstract them away and return the pure payload.

sandeepmistry commented 7 years ago

Fair point.

@cmaglie @facchinm @damellis @tigoe what to you think about this?

The two approaches we are discussing for support chunk HTTP responses are as follows:

1) The library handles the chunked response internally in HttpClient::read() and HttpClient::available(). So, the users of the library don't have to deal with any chunk handling in their code.

2) HttpClient::read() and HttpClient::available() return whatever sent on the wire, including the chunk marks. HttpClient::responseBody() is the only API that handles chunked responses.

tigoe commented 7 years ago

I'd prefer that chunks were handled internally, without the user needing to be involved. But if we wanted to compromise, maybe there could be an optional parameter for read() to get chunks? Or maybe a function for reading in chunks?

zonque commented 7 years ago

I'm confused. That would mean even for non-chunked encodings, ::read() needs to understand whether the client is currently reading the header or the body. How would a user use ::read() to access the headers then?

Most of my uncertainty here comes from an unfinished attempt to implement the logic in ::read() and ::available(), and it feels really wrong, at least when it's based on the current implementation. ::skipResponseHeaders() and friends also call into ::read(), while ::read() itself must know whether the headers have already been processed etc.

tigoe commented 7 years ago

Conceptually what I'm imagining is that you can read() and see available() for the content without chunk marks, as I think that will be the most common use case. But I can see a need for certain advanced users to see the chunks. I'm agnostic as to how that's done.

zonque commented 7 years ago

Conceptually what I'm imagining is that you can read() and see available() for the content without chunk marks, as I think that will be the most common use case. But I can see a need for certain advanced users to see the chunks.

And for the headers, presumably. And FWIW, chunk decoding only applies to the body, not the headers.

I'm agnostic as to how that's done.

IMO, providing both the high-level and the low-level interface through the same method, or an overloaded version thereof, is a hack. I really think users can't have it both ways: Either all the complexity is abstracted away through ::read(), then all they'll get is the processed actual HTTP body data. Or ::read() returns the raw data, and higher-level functions such as ::responseBody() take care of the rest. The latter is in-line with how the API looks right now.

But that's just my opinion. We can as well agree to disagree on this :)

tigoe commented 7 years ago

No need to agree to disagree. I'm fine with putting chunk management in different functions than read() and available(). What would you propose for an API that allows the user to handle the response chunk by chunk?

zonque commented 7 years ago

No need to agree to disagree.

:)

I'm fine with putting chunk management in different functions than read() and available(). What would you propose for an API that allows the user to handle the response chunk by chunk?

If that should be needed, HttpClient::responseBody() could gain a boolean parameter with a default value. Ie, HttpClient::responseBody(bool oneChunkOnly = false) or s.th.?

tigoe commented 7 years ago

Sounds good; presumably then you'd need to call for the chunks repeatedly, right? Can you make up a snippet that shows how it might be used in a typical use case? I confess I've never had need to get the individual chunks in any HTTP app, so I'm having trouble imagining a good use case. Though I have a good example of node.js reading chunks which we could borrow ideas from if that helps.

zonque commented 7 years ago

Sounds good; presumably then you'd need to call for the chunks repeatedly, right? Can you make up a snippet that shows how it might be used in a typical use case? I confess I've never had need to get the individual chunks in any HTTP app, so I'm having trouble imagining a good use case. Though I have a good example of node.js reading chunks which we could borrow ideas from if that helps.

That's not what I needed it for, but you could take streaming as one example. I actually never thought about providing an interface for that use case, but yes, we definitely should.

Maybe it even makes sense to add HttpClient::readChunk(), and then make ::responseBody() call into that repeatedly until the body is exhausted. I'll resurrect the hardware the next days and play around with the options.

sandeepmistry commented 7 years ago

I'm just catching up on the activity :)

For chunk by chunk reads, isn't that close to what the Stream API would do? available() would return chunk size, read() reads into it?

Another idea is a readRaw() API, for those wanting to read the raw network bytes. I'm not sure how useful this will be. This concept could be used for GZIP response if we ever support them.

zonque commented 7 years ago

For chunk by chunk reads, isn't that close to what the Stream API would do? available() would return chunk size, read() reads into it?

But again, not all transfers are chunked, there is the good old 'Content-Size` based as well.

Also, currently, ::read() would return the raw bytes, and this library has active users. Changing the behavior under the hood will likely break existing use-cases.

Another idea is a readRaw() API, for those wanting to read the raw network bytes. I'm not sure how useful this will be.

Same thing here. Yes, if you started fresh, that would be an option. I just don't see how ::read() would possibly remain backwards-compatible and support chunked reads, without turning the whole code-base into a stateful mess.

This concept could be used for GZIP response if we ever support them.

As things currently stand, transparent gzip support (which again only applies to the body, not the headers) needs to live in the higher-level APIs. But there, it is totally possible.

tigoe commented 7 years ago

Maybe it even makes sense to add HttpClient::readChunk(), and then make ::responseBody() call into that repeatedly until the body is exhausted. I'll resurrect the hardware the next days and play around with the options.

Seems reasonable to me. Would you still use available() to know when the body's exhausted, or use another method?

zonque commented 7 years ago

Seems reasonable to me. Would you still use available() to know when the body's exhausted, or use another method?

::available() would tell you if it's worth calling into ::readChunk() or ::read() again. ::readChunk() would return -1 once the game is over, I presume. But let's see.

damellis commented 7 years ago

Just to throw in a random opinion, I'd like to see read() transparently merge the chunks and return the unified content. Yes, this is different from the current behavior, but I'd argue that the current behavior is a bug.

What's the use case for wanting to read the raw stream (before the chunks are recombined)?

On Tue, Oct 25, 2016 at 7:26 AM, Daniel Mack notifications@github.com wrote:

For chunk by chunk reads, isn't that close to what the Stream API would do? available() would return chunk size, read() reads into it?

But again, not all transfers are chunked, there is the good old 'Content-Size` based as well.

Also, currently, ::read() would return the raw bytes, and this library has active users. Changing the behavior under the hood will likely break existing use-cases.

Another idea is a readRaw() API, for those wanting to read the raw network bytes. I'm not sure how useful this will be.

Same thing here. Yes, if you started fresh, that would be an option. I just don't see how ::read() would possibly remain backwards-compatible and support chunked reads, without turning the whole code-base into a stateful mess.

This concept could be used for GZIP response if we ever support them.

As things currently stand, transparent gzip support (which again only applies to the body, not the headers) needs to live in the higher-level APIs. But there, it is totally possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arduino-libraries/ArduinoHttpClient/pull/9#issuecomment-256050270, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYuQmbMnxH_P0ao3UtX1xXui8LEsHUVks5q3hGhgaJpZM4JpqgU .

zonque commented 7 years ago

Just to throw in a random opinion, I'd like to see read() transparently merge the chunks and return the unified content. Yes, this is different from the current behavior, but I'd argue that the current behavior is a bug.

But there are existing users. I'm not the maintainer of this library, but if I were, I would strongly object against breaking backwards compatibility.

Also, again, ::read() is used to read the headers as well, which are never encoded in chunks and never gzip'ed. Such techniques only apply to the body, and hence I think it is completely okay to have function that is specific for reading the body.

What's the use case for wanting to read the raw stream (before the chunks are recombined)?

You could use the library to just build the request, and then parse the response yourself.

sandeepmistry commented 7 years ago

But there are existing users. I'm not the maintainer of this library, but if I were, I would strongly object against breaking backwards compatibility.

A similar statement could be said about the changes proposed in this PR. HttpClient::responseBody() currently returns a String response that includes the chunk markers.

I agree with David's view "that the current behavior is a bug.".

Having a different set of API's for non-chunked and chunked responses would push more responsibility to user sketches. The user that makes the HTTP request should not need to care about how the body is encoded (normal, chunks, gzipped) - in the end they want access to the unencoded body. In an ideal world, if the server changed the response type from non-chunked to chunked, the an existing sketch should continue to work transparently.

If we do decide against changing read() and available() to transparently handle chunked responses, another option would be to create a ChunkedHttpClient class that specifically deals with only chunked responses.

zonque commented 7 years ago

A similar statement could be said about the changes proposed in this PR. HttpClient::responseBody() currently returns a String response that includes the chunk markers.

Okay, that's true. Strictly speaking, that's also a break.

I agree with David's view "that the current behavior is a bug.".

Having a different set of API's for non-chunked and chunked responses would push more responsibility to user sketches. The user that makes the HTTP request should not need to care about how the body is encoded (normal, chunks, gzipped) - in the end they want access to the unencoded body. In an ideal world, if the server changed the response type from non-chunked to chunked, the an existing sketch should continue to work transparently.

Then how would the API look like to read a specific header?

If we do decide against changing read() and available() to transparently handle chunked responses, another option would be to create a ChunkedHttpClient class that specifically deals with only chunked responses.

No, I don't think that's a good idea. Whether or not a web server uses chunks to transmit the body should really not matter to the user. A web server may even decide to switch the behavior from one request to the other, and that's perfectly fine according to the RFC.

So - if you are willing to break the effect of existing library calls (I won't object), the question is really how users would access the raw bits and header information if it's no longer possible to do so through ::read().

sandeepmistry commented 7 years ago

So - if you are willing to break the effect of existing library calls (I won't object), the question is really how users would access the raw bits and header information if it's no longer possible to do so through ::read().

This is ok with me, we can update the library major version to indicate a breaking change as well as update the change log. FWIW, the library is also marked as "experimental" in the library.properties.

I'm thinking everything will act the same way as it is now. Except if the library detects the response body is chunked and all the headers have been read in, it will enable some extra logic in available() and read() to transparently de-chunk the request body. Let me know if this is clear, otherwise I would be happy to provide more details for different example scenarios.

The above idea does introduce extra state logic in the library, which increases complexity. However, it leads to a better Arduino UX.

tigoe commented 7 years ago

I'm thinking everything will act the same way as it is now. Except if the library detects the response body is chunked and all the headers have been read in, it will enable some extra logic in available() and read() to transparently de-chunk the request body. Let me know if this is clear, otherwise I would be happy to provide more details for different example scenarios.

What would that extra logic look like? To the end user, or in the library itself?

sandeepmistry commented 7 years ago

What would that extra logic look like? To the end user, or in the library itself?

It would be in the library itself.

For example, if the first chunk size is 4 and all the headers have been processed available() will return 4 (or less if wifiClient.available() has less bytes buffered). Once all the bytes in the current chunk have been read in, available() will try to read in the next chunk size including the \r\n after the chunk size sent on the wire.

tigoe commented 7 years ago

Sounds like a good approach to me.

sandeepmistry commented 7 years ago

Great!

@zonque are you ok with the proposal above? If so, can we close this PR and merge #7 for now.

We have a few other projects on the go at the moment, so won't be able to work on this in short term. However, any pull requests for the changes above are welcome.

sandeepmistry commented 7 years ago

I'm closing this for now, as @agdl has merged #7.