apollographql / apollo-kotlin

:rocket:  A strongly-typed, caching GraphQL client for the JVM, Android, and Kotlin multiplatform.
https://www.apollographql.com/docs/kotlin
MIT License
3.77k stars 656 forks source link

Potential incompatibility with google/cronet-transport-for-okhttp when GQL batching is enabled #5847

Open duzinkie opened 7 months ago

duzinkie commented 7 months ago

Version

3.7.3

Summary

I'm researching moving our GQL clients to use cronet HTTP engine (to leverage http3/quic) and since we rely heavily on OkHttp, decided to use https://github.com/google/cronet-transport-for-okhttp and it's interceptor to minimize the changes to the application I'm working on.

(apologies if the description below sounds too vague/simplistic, I do not know the domain very well and might not have the right vocabulary)

While working on this transition, I have noticed, that, when using GQL batching, along with the interceptor from the package I just mentioned the app would very often (though not always) "hang" when sending requests, which I was able to trace up to some point, but am stuck. I've managed to see that:

I am able to circumvent the issue by providing an okhttp interceptor that reads the entire request and sets the content length to the length of what it just read before handing over to cronet (it's implementation is simple so I'll skip it)

This behavior occurs far less frequent when the app is being debugged and breakpoints are present at various stages, suggesting it might be some sort of deadlock/race condition, but I'm unable to pinpoint that exactly. CronetInterceptor does call [.allowDirectExecutor();](https://developer.android.com/develop/connectivity/cronet/reference/org/chromium/net/UrlRequest.Builder.html#allowDirectExecutor()) which is documented with this warning

Warning: This option makes it easy to accidentally block the network thread. It should not be used if your callbacks perform disk I/O, acquire locks, or call into other code you don't carefully control and audit.

but as mentioned before, I don't know the domain well enough to say if this is significant.

Lastly, I couldn't find any similar issues, or anything regarding usage of apollo and cronet (although I'm aware of the option of implementing HttpEngine directly), but still figured you might be interested to know that there's a potential incompatibility here. I've reported a twin issue in the https://github.com/google/cronet-transport-for-okhttp project (https://github.com/google/cronet-transport-for-okhttp/issues/36).

Of course I understand it can also be the case that it's neither apollo nor cronet-transport but our own integration with those that causes the problem, or that this doesn't align with your own roadmap, feel free to resolve as you see fit.

Let me know if you'd like me to share any more information, and thank you for any assistance you can provide.

Steps to reproduce the behavior

unfortunately my project is not open source. I could attempt reproducing in a smaller project if anyone investigating this issue cannot find anything obvious.

Logs

I don't have any logs and/or stacktraces atm.

BoD commented 7 months ago

Thanks for reporting this!

That's interesting. If there is an attempt somewhere down the line to read the body multiple times, having a quick look back at BatchingHttpInterceptor, I don't immediately see why this would cause the hang, but not 100% sure. I am not familiar with Cronet unfortunately so I couldn't tell why this happens either. The issue being not systematic doesn't help of course :)

Maybe it would make more sense to implement HttpEngine directly without going through the OkHttp adapter? That would at least eliminate one possible source of the issue - however, I don't know how easy that would be and don't want to send you to a wild goose chase.

I was also wondering if Batching is really useful in your case? The main reason to using it is to avoid the overhead of making an HTTP call per query, but maybe when using HTTP2 or HTTP3/Quic the advantage of doing so is slim? And I'm also curious if other than the issue with batching, using Cronet works well otherwise?

In any case, if you want to pursue this, indeed a small reproducer project would definitely help.

duzinkie commented 7 months ago

Maybe it would make more sense to implement HttpEngine directly without going through the OkHttp adapter? I was also wondering if Batching is really useful in your case?

Indeed, we're considering both. Using okhttp for now appears to be the "path of least resistance" in our use case and I'd imagine other apps could also the same choice and potentially end up in the same spot so at least this issue can act as "heads"-up for them.

In any case, if you want to pursue this, indeed a small reproducer project would definitely help.

I'll give that a try, but it might take a while as I'm traveling soon.

duzinkie commented 6 months ago

I'll give that a try, but it might take a while as I'm traveling soon.

I've not been able to reproduce this particular issue in an example project (suggesting maybe it's cause by something else in our networking stack). However, I was able to observe and reproduce some issue with cronet request lifecycles when using the same libraries - I've described those in https://github.com/apollographql/apollo-kotlin/issues/5885

martinbonnin commented 5 months ago

Closing in favor of #5885

github-actions[bot] commented 5 months ago

Do you have any feedback for the maintainers? Please tell us by taking a one-minute survey. Your responses will help us understand Apollo Kotlin usage and allow us to serve you better.

martinbonnin commented 5 months ago

Ooops, might have closed this a bit too fast. Maybe it was a separate issue? @duzinkie is there anything we can help with here?

duzinkie commented 5 months ago

Yes, afaik this is still affecting us, however I was unable to reproduce this in a smaller project so far so it's hard to move forward atm, but I've not spent much time on that reproduction. Iirc enabling batching wasn't enough to trigger the issue, it's possible that one also has to then make it send multiple requests at once (but like I said, can't confirm)

(we do have a workaround that flushes the entire request into a buffer, thus making google cronet integ layer behave well)