Open QnJ1c2kNCg opened 2 years ago
Related: https://github.com/tokio-rs/tokio/issues/4730
I'll take a look at this when I'm able (maybe next week, no promises 😄 )
In the meantime, you may have some success if you use tokio-console
to take a look and see what task is blocking everything.
(I'm leaving the needs-triage
label on this so it doesn't get swept under the rug.)
I've been rather busy but I'm going to take a look at this issue this week.
I've had a bit of trouble reliably reproducing this from your example code. Sometimes it runs fine and other times it gets stopped up after the first iteration of the sync loop just as you're experiencing.
Based on tokio-console
, it really does look to be that "one bad task can hold up the executor" issue that I posted in my earlier comment.
I'm not sure what we can do about this other than wait for the tokio
team to address the issue.
For anyone else that runs into an issue like this, try adding console_subscriber::init();
and running tokio-console
to see if you're getting similar results.
I think it's important to track this issue but I've removed the bug tag since there's really nothing that we can do about this short of contributing a fix to tokio
or changing the default executor in the SDK.
Firstly, thanks a ton for looking into this!
Sometimes it runs fine and other times it gets stopped up after the first iteration of the sync loop
Same here, the repro isn't 100% consistent.
This is very interesting, thanks for the tokio issue link.
Any intuition as to why creating a new client instead of clone
ing seems to help? I was unable to reproduce while using two distinct clients. Using two clients is a decent enough mitigation for now.
I'll keep an eye on both these issues, thanks again.
Any intuition as to why creating a new client instead of
clone
ing seems to help? I was unable to reproduce while using two distinct clients. Using two clients is a decent enough mitigation for now.
The client is just using an Arc
under the hood, so the clone ends up referencing the same hyper client and connection pool. Creating a new client will result in that client having its own hyper client. How that plays into this particular scenario, I don't know enough to answer.
Any chance someone can run this on valgrind with --tool=helgrind? It should print out the deadlock if it's a traditional "incorrect mutex usage" deadlock
I've been seeing this in ec2::DescribeVolumes calls...
Also seeing this in s3::PutObject
@jdisanti I can sort of reliably reproduce this (happens 1 out of 10 runs). Any pointer to debug this further?
(also hangs on ec2::DescribeInstances)
Setting operational timeouts (https://github.com/gyuho/aws-manager/commit/e4672f0f246b49569547793989bf9730460cc2f0) seem to mask this issue... haven't encountered this issue for two days so far 👍🏼
I didn't see this above in the thread—does this occur on both multi threaded and single threaded runtimes? @Velfi do you have a full workspace I can go investigate?
Given it seems to be one in 10 and setting a low timeout makes it go away - it suggests that a suspicious round number like that and overriding the timeout means its likely to be a broken connection in the pool that is hit on a round robin/random causing the problem.
I didn't see this above in the thread—does this occur on both multi threaded and single threaded runtimes? @Velfi do you have a full workspace I can go investigate?
@rcoh I do not. I just took some test data and was working with that. I can't recall how single vs. multi-threading affected this.
Same issue when using s3::put_object in async loop. In my case, it blocks everytime and timeout solution did not work. Spawning a tokio::task in every loop solved my problem.
async fn executor(...) {
...
while !shutdown_rx.has_changed().unwrap() {
//https://github.com/awslabs/aws-sdk-rust/issues/611
task::spawn(
receive_image(Arc::clone(&name), Arc::clone(&configs), Arc::clone(&aws_client), Arc::clone(&encode_rx))
).await.unwrap();
}
...
}
Describe the bug
I've encountered a problem that I cannot explain, where a deadlock happens. See reproduction steps for more details.
An interesting note is that if I create a new DynamoDB client, instead of
clone
ing, I do not see the issue.Expected Behavior
I would have expected to see something like:
Current Behavior
I'm seeing a deadlock:
Reproduction Steps
Possible Solution
No response
Additional Information/Context
No response
Version