Open dandesousa opened 7 years ago
Low write rates make it more likely that the sockets the KPL is holding will be reset, which it only finds out when it goes to make request on the socket. Are you able to try the newer 0.12.x version of the KPL? It uses the C++ AWS SDK, which uses curl. curl may handle the connection reset's better than the 0.10.x socket management.
We've noticed a lot of noise in production lately and a special case of some issue that might have presented with some others (https://github.com/awslabs/amazon-kinesis-producer/issues/17).
It appears as though there are is some odd behavior in the KPL when dealing with steady, low traffic streams, the KPL constantly emitting following error:
There are no other errors in the log, just the above over and over.
For clarity around this behavior this I ran the KPL with aggregation disabled, such that one record sent to the KPL = one record in CloudWatch metrics. The stream has only one shard, and all records go to that shard.
I start by pushing a single record (~1k in size) every 10 seconds, producing approximately 6 records per minute. When doing so, the KPL handles it find, we see it in the metrics and there are no errors.

Then, I send 20 records each minute. When doing so, the KPL steadily throws socket end of file errors from the kinesis endpoint around every 10-15 seconds or so. This continues as long as the traffic remains steady in that range.

Then, if I increase the traffic to 60 rpm, the errors disappear entirely.

Finally I can taper down the requests and the behavior presents itself again (full timeline back to 20 rpm):
This behavior is present if we aggregate records or not, is reproducible across multiple different streams, regardless of the number of shards.
No obvious data seems to be getting lost here. But it disrupts our ability to detect true errors from the KPL and generates a ton of noise.
I’ve iterated on every relevant looking KPL setting to try and get this to go away to no avail (tweaked timeouts, record aggregation, etc).
KPL version 0.10.2 was used to do the test.
Any idea what is going on here?
Kinesis settings below: