getindata / flink-http-connector

Http Connector for Apache Flink. Provides sources and sinks for Datastream , Table and SQL APIs.
Apache License 2.0
150 stars 42 forks source link

HTTP connect timed out error #66

Closed louiscb closed 8 months ago

louiscb commented 11 months ago

Using this library with v0.9 running on AWS managed Apache Flink. We see about 50% of our requests end with the below java.net.http.HttpConnectTimeoutException: HTTP connect timed out. The curious thing is that we aren't actually seeing those requests make it to the service with the endpoint, it appears the timeout is occurring within Flink and the requests aren't actually being sent out.

Is it possible that an external factor is causing the timed out error to occur in this library apart from the actual http request to the endpoint timing out. We also see an error messaging logging that e.g: Http Sink failed to write and will retry 16 requests, but from reading the documentation and the code it seems that the library doesn't actually retry timeout requests?

Request fatally failed because of an exception

java.util.concurrent.CompletionException: java.net.http.HttpConnectTimeoutException: HTTP connect timed out
    at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
    at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
    at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1074)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
    at java.net.http/jdk.internal.net.http.Http1Exchange.lambda$cancelImpl$9(Http1Exchange.java:482)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.http.HttpConnectTimeoutException: HTTP connect timed out
    at java.net.http/jdk.internal.net.http.ResponseTimerEvent.handle(ResponseTimerEvent.java:68)
    at java.net.http/jdk.internal.net.http.HttpClientImpl.purgeTimeoutsAndReturnNextDeadline(HttpClientImpl.java:1248)
    at java.net.http/jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:877)
Caused by: java.net.ConnectException: HTTP connect timed out
    at java.net.http/jdk.internal.net.http.ResponseTimerEvent.handle(ResponseTimerEvent.java:69)
    ... 2 more
kristoffSC commented 11 months ago

Hi @louiscb are you using the newest version? It will help me with debugging.

Regarding the retry, yep it is not sported currently. The long line is misleading. We had retries at some point but since there were not business use case for our client to handle those, we removed them with plan to add in the future. The log line left...

kristoffSC commented 11 months ago

Hi again @louiscb I did a quick debugging and I think that the problem might not be in the connector. I think that what you see here is caused by the fact that your HTTP server might be overwhelmed. The timeout you see comes from Java's HTTP client. It means that actual HTTP request was executed - hence send.

The curious thing is that we aren't actually seeing those requests make it to the service with the endpoint,

You meant you don't see them in HTTP server logs right? I think you would see them in TCP dump on Flink node or Http Server node.

Do you know how many concurrent requests (burst size) your http server can handle?

Before Http connector version 0.10.0 every event was sent as individual REST call. In version 0.10.0 we have changed that and individual request will have an array of events, so it will batch events per individual request. This can decrease number of concurrent calls to HTTP endpoint.

The batch size is set to 500 but in practice it can be lower than this. It depends on size of individual message and time that events arrived to AsyncSinkWriter. Flink batches events in AsyncSink under below thresholds:

Whatever hit first, triggers AsynkSink Write (http call).

I did a test with old Single submission mode and block all threads on my http endpoint - I got 4 of them. It meas that my endpoint could handle only 4 concurrent requests. On the connector side I have ~19 HTTP calls. 4 of them were pending on blocked threads in HTTP mock server and rest of them were... in network limbo. They were send but HTTP server did not pick them up - had no resources for it. After ~30 seconds (default HTTP sink timeout value) all 19 requests failed with timeout.

What you can do:

In future what we could have is Rate Limiting strategy that was added to Async Sink in Flink 1.16

louiscb commented 8 months ago

Thanks @kristoffSC , it was indeed another issue external to this library 👍