Cronet request lifecycle not behaving correctly with Apollo.

duzinkie commented 6 months ago

Version

3.8.4

Summary

I've adapted https://github.com/apollographql/apollo-kotlin-tutorial to use https://github.com/google/cronet-transport-for-okhttp (https://github.com/duzinkie/apollo-kotlin-tutorial/tree/duzinkie/cronet-lifecycle-repro) and have observed the app is not functioning correctly when Cronet is used as an http engine.

Let me elaborate: The app built from https://github.com/duzinkie/apollo-kotlin-tutorial/tree/duzinkie/cronet-lifecycle-repro is working fine, until one tries to adjust the logging level of HttpLoggingInterceptor (https://github.com/duzinkie/apollo-kotlin-tutorial/commit/9b3e58ab1dbdb9e2a3151a43e3cdb44531e5b079#diff-518bd35be782210b23f518eaf618c538f99e9f33a81f6b518ccb053b4315eff1R94) - when that happens 2 distinct failure modes occur, depending on whether http batching is enabled.

if batching is enabled: https://github.com/duzinkie/apollo-kotlin-tutorial/commit/9b3e58ab1d - cronet engine does not call the method attached to it in the request finished listener in https://github.com/duzinkie/apollo-kotlin-tutorial/commit/9b3e58ab1dbdb9e2a3151a43e3cdb44531e5b079#diff-518bd35be782210b23f518eaf618c538f99e9f33a81f6b518ccb053b4315eff1R48
if batching is not enabled - all the requests are considered cancelled (although from the UI point of view - everything seems to be functioning just fine (app is responsive, renders fetched data etc)

side notes:

I've recently submitted https://github.com/apollographql/apollo-kotlin/issues/5847 but since it's about a different issue, I've decided to create a new ticket - I can move this to the other one though if that's your preference.
I'm able to reproduce the same failure modes in the app I'm working on integrating with cronet (with small distinction that in our app, when batching is enabling we don't see the app stalling completely after a few requests - nonetheless cronet lifecycle events are not being fired (except for maybe 1-2% of requests, rather unpredictably).

Steps to reproduce the behavior

Clone https://github.com/duzinkie/apollo-kotlin-tutorial/tree/duzinkie/cronet-lifecycle-repro
Build the app, interact with it, observe it's working fine, and logs contain GQL calls as well as cronet request finished: 0 lines from the cronet http engine (indicating requests are finishing just fine)
Adjust Apollo.kt file in the cloned project: a. change the logging level of HttpLoggingInterceptor to BASIC b. disable httpBatching
Observe the app behavior and logs in both 3a and 3b cases:
- in 3a - cronet request finished messages are absent and the app hangs after a few interactions
- in 3b - all requests are marked as cancalled (as indicated by cronet request finished: 2 log messages

Logs

I think reproduction steps is enough for this issue, lmk if you'd like to see my logs anyway.

duzinkie commented 6 months ago

Posting some of my analysis for the "http batching turned on scenario"

Peeking at https://github.com/google/cronet-transport-for-okhttp/blob/master/java/com/google/net/cronet/okhttptransport/OkHttpBridgeRequestCallback.java one can observe that the compatibility layer between Cronet and Okhttp as implemented there tries to be as lazy as possible - in particular - it will not actually call "request.read" unless users of those library actually attempt to read the response body (this is consistent with the "everything behaves" fine observed when request bodies are being logged).
As per cronet docs - https://developer.android.com/develop/connectivity/cronet/reference/org/chromium/net/UrlRequest.Callback.html#public-abstract-void-onreadcompleted-urlrequest-request,-urlresponseinfo-info,-bytebuffer-bytebuffer

public abstract void onReadCompleted (UrlRequest request, UrlResponseInfo info, ByteBuffer byteBuffer) - this is expected client behavior.

Invoked whenever part of the response body has been read. Only part of the buffer may be populated, even if the entire response body has not yet been consumed. With the exception of onCanceled(), no other UrlRequest.Callback method will be invoked for the request, including onSucceeded() and onFailed(), until UrlRequest.read() is called to attempt to continue reading the response body.

The JSON parsing logic of BatchingHttpInterceptor.kt#183 and it's implementation in JsonReaders.kt#25 seems to stop early and not read the JsonReader.Token.END_DOCUMENT which in turn never gets to signal to the underlying HTTP engine to continue reading the request, thus never getting to realize that there's actually nothing more to read, and firing all the lifecycle events associated with that. That also explains why the app would behave as if it received all the data.
a simple hackfix I've attempted and verified to work, changing the batching interceptor code around parsing JSON from:

 val list = AnyAdapter.fromJson(BufferedSourceJsonReader(responseBody), CustomScalarAdapters.Empty)

to

val responseReader = BufferedSourceJsonReader(responseBody)
val list = AnyAdapter.fromJson(responseReader, CustomScalarAdapters.Empty)
responseReader.hasNext()

appears to make the app behave correctly.

ofc I realize this is not a proper fix - but I wanted to post my findings at the end of the day. Perhaps a more robust one should verify, that after reading the batched operation responses array, end document token appears - happy to hear your thoughts/suggestions on this. Also I have not looked into the scenario where HttpBatching is not enabled, and what causes all requests to be interpreted as cancelled by Cronet (I suspect json parsing also to be at play here, but I'm simply out of time for today).

martinbonnin commented 6 months ago

Thanks for the detailed issue and reproducer ❤️.

The JSON parsing logic of BatchingHttpInterceptor.kt#183 and it's implementation in JsonReaders.kt#25 seems to stop early and not read the JsonReader.Token.END_DOCUMENT which in turn never gets to signal to the underlying HTTP engine to continue reading the request, thus never getting to realize that there's actually nothing more to read, and firing all the lifecycle events associated with that. That also explains why the app would behave as if it received all the data.

Indeed, with JSON we can know that an object/list is terminated without reading the end of stream. This would allow to read several objects concatenated in the same stream. Not that it's the case here but the API allows for that.

What should happen is that the BatchingHttpInterceptor reads the JSON response and emits the GraphQL responses upstream. When done, the consumer cancels the Flow, which in turns cancels the request. All in all, I would say that getting cancellation before reading the end of stream is probably expected at this point.

This also means that if a server sends garbage after a JSON response, ApolloClient is lenient to that and will just ignore the garbage. We could probably change that to be more strict but I'm not 100% sure about the implications.

Do you have issues besides lifecycle events (request finished) not being fired? How bad would it be to just ignore those missing events (and the cancelled ones as well?)

duzinkie commented 6 months ago

Thanks for the detailed issue and reproducer ❤️.

np. fun fact - my team told me we're the og authors of batching implemented in apollo-kotlin :)

Indeed, with JSON we can know that an object/list is terminated without reading the end of stream. This would allow to read several objects concatenated in the same stream. Not that it's the case here but the API allows for that.

I didn't know that, my knowledge of gql is rather limited - what else could be there in the response? Wouldn't that indicate that it's no longer a JSON document (but something like JSONL instead?) - and so in turn, wouldn't that be discernible by content-type and/or some other parts of the response?

also the way I understand the current batching interceptor impl, would anyone really have a chance to access the rest of the response?

What should happen is that the BatchingHttpInterceptor reads the JSON response and emits the GraphQL responses upstream. When done, the consumer cancels the Flow, which in turns cancels the request. This also means that if a server sends garbage after a JSON response, ApolloClient is lenient to that and will just ignore the garbage. We could probably change that to be more strict but I'm not 100% sure about the implications.

Is that consistent with the apollo client behavior when there's no batching turned on? (e.g: would extra content beyond the first JSON "entity" also be ignored there?)

All in all, I would say that getting cancellation before reading the end of stream is probably expected at this point.

Just to be clear - that's not the current behavior. I've described the cancellations being fired for all the requests, but only when batching is turned off. I haven't looked at the root cause of that yet.

Do you have issues besides lifecycle events (request finished) not being fired? How bad would it be to just ignore those missing events (and the cancelled ones as well?)

In the case of the example app, it just stalls after a few requests, so that can't be ignored. Haven't observed that in our app, but I'm guessing regardless, cronet http engine itself must have some limits on requests considered in-flight.
Ignoring cancellations would be more digestible but still presents a problem of not being able to distinguish between actual cancellations (triggered, by let's say - a user navigating away from the ui that a request originated from) and those "normal batching cancellations" - we look at data like this as one of the inputs in assessing our apps health across releases. My preference would be to not signal abnormal exit if there's not really anything wrong happening (or at least have clients be able to opt into how they'd like to handle that)
We still have the other issue reported in https://github.com/apollographql/apollo-kotlin/issues/5847

martinbonnin commented 6 months ago

fun fact - my team told me we're the og authors of batching implemented in apollo-kotlin :)

It's a small world :)

what else could be there in the response? Wouldn't that indicate that it's no longer a JSON document (but something like JSONL instead?) - and so in turn, wouldn't that be discernible by content-type and/or some other parts of the response?

My point was entirely theoretical based on low level JsonReader API.

At a high level, no GraphQL server that I know of, batching or not batching, sends data after the payload. So it should be safe to consume that.

May I suggest a fix like this?

      val list = responseBody.jsonReader().use {jsonReader ->
        // TODO: this is most likely going to transform BigNumbers into strings, not sure how much of an issue that is
        AnyAdapter.fromJson(jsonReader, CustomScalarAdapters.Empty).also {
          if (jsonReader.peek() != JsonReader.Token.END_DOCUMENT) {
            println("Apollo: extra tokens after payload")
          }
        }
      }

If it works for you, feel free to open a PR and I'll review it but I probably won't have time to dive into this myself as it's KotlinConf busy busy times right now.

duzinkie commented 6 months ago

May I suggest a fix like this? If it works for you, feel free to open a PR

Ty, yeah, I think something in that vein is ok. I'll take a closer look at the cause of cancellations when batching is off, and also revisit the old issue I've submitted, but should be able to squeeze in a PR, even if it's just for the batching interceptor.

busy busy times right now.

np, your responses have been very helpful. I'm traveling next week as well, though not for KotlinConf :). This is is not affecting us in any severe way, and a path forward looks relatively clear now.

duzinkie commented 6 months ago

short update on the "requests cancelled when http batching is not enabled"

it's sort of a similar case. excepts from stacktrace suggest:

google/net/cronet/okhttptransport/OkHttpBridgeRequestCallback.java#L284
that's due to automatic close at the end of use block in [apollo3/api/internal/ResponseParser.kt#L24] (https://github.com/apollographql/apollo-kotlin/blob/release-3.x/libraries/apollo-api/src/commonMain/kotlin/com/apollographql/apollo3/api/internal/ResponseParser.kt#L24)
and that in turn is called by HttpNetworkTransport.kt#L91
- mulitpleResponses could possibly behave the same, although the parsing logic there is significantly different

I haven't experimented with a "hack fix" as replacing the entire transport/parser seems more involved than just an interceptor, but my bet is this is the cause.

Anyways, in terms of PRs imo that should be a different one. And perhaps it could be similar - it doesn't sound unreasonable for an object called ResponseParser to assert that it has read the entire response before closing the buffer (though ofc, how to handle cases where it had not is something to be discussed, especially from compatibility perspective). idk, perhaps the control of the reader should be pulled higher up the call stack.

In any case, this is a secondary issue.

martinbonnin commented 6 months ago

it doesn't sound unreasonable for an object called ResponseParser to assert that it has read the entire response before closing the buffer (though ofc, how to handle cases where it had not is something to be discussed, especially from compatibility perspective)

My initial thoughts on this are that both should check for the EOF token after reading batched list/response. And if not EOF then just ignore the trailing token. We might need a function that consumes a jsonReader till the end. The way it is done right now, it requires a bit of parsing.

duzinkie commented 6 months ago

Hello, I've published 4 PRs for the main/3-x branches (the fixes are the same though). Idk why, but github misses one in the list of references, so here are all the links:

https://github.com/apollographql/apollo-kotlin/pull/5892 https://github.com/apollographql/apollo-kotlin/pull/5893 https://github.com/apollographql/apollo-kotlin/pull/5894 https://github.com/apollographql/apollo-kotlin/pull/5895

I'm in the process of getting approval for accepting CLA. in the meantime, happy to hear your thougths.

duzinkie commented 6 months ago

Now, that I've created the PRs (and after some internal discussions with my team), I'm considering if it's the right choice to just fail silently on extra payload in content, would like to discuss that part once more.

From the compatibility standpoint, ofc that would introduce exceptions on paths where those weren't thrown before. On the other hand, any other "json not valid" would throw an exception so this could be seen as just fixing an inconsistency in the already existing strict approach to such errors. Additionally, like you said, in practical applications, it will always be end-of-document, so to an extent the debate is about which form of no-op is best :)

martinbonnin commented 6 months ago

I'm considering if it's the right choice to just fail silently on extra payload in content, would like to discuss that part once more.

Indeed, that's a fair point. I was going to quote the robustness principle but this is more and more questioned. See https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html for an example. Being strict does have its advantages.

The thing is I don't think this behaviour is specified anywhere (and even less for batching) so if we want to be strict there, I'd rather make sure we're not breaking someone elses workflow. Might be worth raising the concern in the GraphQL over HTTP spec ?

duzinkie commented 6 months ago

I'll try to clarify that in the spec repo - https://github.com/graphql/graphql-over-http/issues/292

as far as I can read/understand the spec:

When a server receives a well-formed GraphQL-over-HTTP request, it must return a well‐formed GraphQL response

https://graphql.github.io/graphql-over-http/draft/#sec-Response

A response to a GraphQL request must be a map.

https://spec.graphql.org/draft/#sec-Response

it boils down to whether including anything that's semantically different from "end of document" breaks the "well-formedness" of the response.

martinbonnin commented 6 months ago

When a server receives a well-formed GraphQL-over-HTTP request, it must return a well‐formed GraphQL response

Looks like a pretty good case that trailing data is an error and should be rejected by the client. Let's give it a couple of days to see what everyone thinks about this but I'm more and more team "let's make this an error"

martinbonnin commented 6 months ago

@duzinkie Let's make it an error? I just added a comment in your main PR here.

Benjie has a good point that if there is a lot of whitespace (and no trailing token), the response will now be delayed until all the whitespace is consumed but this is the case everywhere in the json response, not only at the end so backends should be aware of whitespace in all cases. What's more, it should reasonnably never happen in practice.

martinbonnin commented 5 months ago

@duzinkie this is all merged in and will be available in the next release.

BoD commented 4 months ago

Fix is in v4.0.0-rc1 and v3.8.5.

github-actions[bot] commented 4 months ago

Do you have any feedback for the maintainers? Please tell us by taking a one-minute survey. Your responses will help us understand Apollo Kotlin usage and allow us to serve you better.

apollographql / apollo-kotlin