Open tayl opened 1 year ago
Any more info I can provide to help diagnose?
Salut!
It has an abundance of CPU and memory available.
Obvious, friendly call-out: just because a host has resources does not ensure those resources are accessible to the process (e.g., JVM). Since the other consumers are self-reported as fine, I'll assume this isn't a factor but it might be worth a peek.
Off-the-cuff suggestions before diving deep:
Hi stair, thanks for getting back to me.
These consumers are running in dedicated ECS tasks, there are no other processes running, so at least according to the ECS reported metrics, they are under-utilized. ~10% CPU utilization and 5% mem. Min and max of each is within a few points of the average.
We've restarted many times. While the consumers are down, data piles up in the Kinesis shards. When the consumers come back, all consumers except the affected one burn through their backlog quickly. The one that is stuck at 86m ms accumulates 500k or so ms of iterator age during downtime. After downtime, it burns through the 500k ms, and settles right back around 86m.
Again these are ECS tasks, hardware is extrapolated away but I assume every time it's a different machine.
Is there a benefit to upping the shard count if the shards are under utilized, and not showing any failures or throttling?
Thanks
To rule it out, we doubled the CPU resources of the ECS tasks temporarily and saw no change. Again, the iterator of the affected shard climbed above 86m during downtime, and then quickly settled back to 86m once the consumer was running again.
Sorry for the back to back to back posts, just throwing in more info. Another consumer/shard combo has jumped up to that 86.1m iterator age number and is stuck there. We've looked and see nothing in our application that should be producing that number, or 1 day, or 24 hours.. Additionally, our retention period is 3 days, not 1, so anything near 1 day worth of ms in iterator age is unusual. The fact that two decoupled (other than processing the same Kinesis stream) consumers are doing this suddenly makes me think it's Kinesis related?
Hello, we have an instance of a Kinesis consumer in one of our customers environments that is "stuck", but not in a typical way (typical to me anyway). The consumer is requesting and processing records as it usually does, but the records that are coming back got further and further behind, until they were exactly one day behind. Our iterator age for this shard hovers right around 86,100,000 ms. If that was our data retention period, that would make sense to me, but it isn't - this stream is set to 3 days. Additionally, the consumer is not burdened in any way. It has an abundance of CPU and memory available.
I think key to solving this is that the consumer is requesting more records than it's getting back. If it were up to date, I'd understand this, as it would get only what is new and nothing more. However, that's not the case - it's as if the shard believes now() - 24 hours is real-time. Additionally, CloudWatch is showing no read throughput exceeded errors.
I've confirmed using the data viewer in Kinesis web that there is data between now and 24 hours ago.
That's a long winded way of saying this consumer seems to have time traveled to 24 hours ago. From that frame of reference it's processing data in real-time. All other 32 consumers running on this 32 shard stream are doing fine. It's just this one that is confused.
What could cause this? Any more info I can provide to help diagnose?