Open michaelzelaia opened 2 months ago
Hello, I also see many warnings similar to the first one posted in this issue from my KCL application. Could this issue affect the consumption or processing of data in any way?
It's not clear whether there's data loss or not (I did not confirm this) as a result of these exceptions. I found that when it happened constantly it would eventually raise OOM exceptions and crash, so chances are data loss could happen and of course that will slow consumption down.
In our case it was definitely a case of Autoscaling drowning the EC2 instance the client was running on (over 250 shards, each shard coming with their own handful of threads), so I would recommend you keep an eye on CPU usage unless you're doing serverless. Once the client was up to speed these issues would cease. However, we decided to set the stream to a fixed number of shards to avoid this kind of problem in the future.
Hello!
We are using the 2.5.8 KCL version on a java client that runs in a Docker container. The Docker image is the Oracle Linux 7 slim and the JRE is the Adoptium Temurin 21.0.2. We have been fetching data from a low traffic volume stream for quite some time now without any issues. We recently connected a new worker (our client process is multi-threaded and we're using the single stream interface, so each of our client threads is a worker that will fetch data from a single stream) to fetch from the new, higher-volume stream that has a significant backlog of data. This stream is configured as an auto-scalable one, and as of now it has 168 shards based on what I've seen on the DynamoDB table.
At first it looked OK, but after a bit we started getting a lot of Netty errors like the following:
After trying a custom httpClientBuilder for the kinesis async client (NettyNioAsyncHttpClient.Builder) to set custom maxConcurrency, maxPendingConnectionAcquires, connectionTimeout, & connectionAcquisitionTimeout to no avail we realised the machine was CPU bound and that was the main issue (the parsing was taking far too long at some points due to the CPU having to deal with too many threads), so decided to create multiple EC2 machines to deal with the workload. Now processing the messages is taking in the order of 100s of milliseconds vs consistently taking several seconds. While the mini-fleet has been able to quickly plough through the backlog and get up to speed, I can still see a few of these errors have occurred while processing the backlog and I have a few questions in this regard:
For reference, this is how we build the Scheduler and the required clients:
This is our RecordProcessor
I can provide other bits if needed.
Thank you very much for your help in advance.