Open nuel77 opened 1 month ago
The above graph shows the uploader task in production. It has many other components but as you can see after restarting the process, the latency improved. The latency was jittery and very high after few days of running that process. We are pretty sure the amount of data we are transmitting doesn't need 15s to upload on a 25 Gbps server. Hope this gives some additional data to debug.
Also, if you look at the right end of the graph, you can still see some random spikes after restart.
NOTE: Running average of latency is around 100-200ms. The graph is showing the maximum recorded latencies
Thank you for providing us with a detailed description of your observations. As there may be various factors involved, we might be shooting in the dark but let us bring pieces of information we have to ensure we're aligned.
we are concerned if cloning the aws_sdk_dynamodb::client for each upload task is causing any deadlock situation in the library code
Hmm, it's hard to imagine simply cloning an ec2 Client could cause a deadlock since it just involves bumping up a reference count of Arc
🤔
Do you turn on tracing_subscriber
in your application code to see if there are any logs around the time a latency spike occurred, potentially showing any log outputs that you usually don't see when things are working?
Will we check this add get back to you.
Additionally, tokio-console can be really helpful for debugging issues with blocking tasks because it can show you the idle vs. active time for each task.
Thanks for the suggestions @Velfi , we are adding this to our tests, will get back with our findings in a few days.
Describe the bug
We are experiencing random latency spikes (which go up to 10s) in DynamoDB operations while using the AWS Rust SDK in our project. These spikes randomly and seem to increase in frequency with increasing time. A weird thing we observed is that the latency doest seem to correleate with the load for example updating 100 items takes 50ms while updating 10 items might sometimes take 100+ms . Despite our efforts to debug the issue, we have been unable to identify the cause. Athough we are concerned if cloning the
aws_sdk_dynamodb::client
for each upload task is causing any deadlock situation in the library code. would be great to get some pointers here for us to work on.Expected Behavior
Latency to be fairly consitent and but increasing depending on load .
Current Behavior
Random spikes in latency even for smaller loads.
Reproduction Steps
latency graph from cloud watch:
code snippet for uploader task: https://gist.github.com/nuel77/a48fee5172abdf47efeba14bdaaff3b7
also, our client looks like this:
Possible Solution
No response
Additional Information/Context
No response
Version
Environment details (OS name and version, etc.)
ubunutu
Logs
No response