Unexplained blocking behavior with DynamoDB client

QnJ1c2kNCg commented 2 years ago

Describe the bug

I've encountered a problem that I cannot explain, where a deadlock happens. See reproduction steps for more details.

An interesting note is that if I create a new DynamoDB client, instead of cloneing, I do not see the issue.

Expected Behavior

I would have expected to see something like:

Hello, world!
Sync background scan size: 7528
Async background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
Sync background scan size: 7528
...

Current Behavior

I'm seeing a deadlock:

Hello, world!
Sync background scan size: 7528
Async background scan size: 7528

^C

Reproduction Steps

async fn async_background_scan(client: aws_sdk_dynamodb::Client) {
    let scan = client.scan().table_name("Buckets").send().await.unwrap();
    println!("Async background scan size: {}", scan.count());
    loop {}
}

fn blocking_background_scan(client: aws_sdk_dynamodb::Client) {
    loop {
        let scan = futures::executor::block_on(client.scan().table_name("Buckets").send()).unwrap();
        println!("Sync background scan size: {}", scan.count());
    }
}

#[tokio::main]
async fn main() {
    let config = aws_config::load_from_env().await;
    let ddb_client = aws_sdk_dynamodb::Client::new(&config);

    tokio::spawn(async_background_scan(ddb_client.clone()));
    tokio::task::spawn_blocking(move || blocking_background_scan(ddb_client));

    println!("Hello, world!");
    loop {}
}

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v0.47.0
│   ├── aws-http v0.47.0
│   │   ├── aws-smithy-http v0.47.0
│   │   │   ├── aws-smithy-types v0.47.0
│   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-types v0.47.0
│   │   │   ├── aws-smithy-async v0.47.0
│   │   │   ├── aws-smithy-client v0.47.0
│   │   │   │   ├── aws-smithy-async v0.47.0 (*)
│   │   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   │   ├── aws-smithy-http-tower v0.47.0
│   │   │   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   │   ├── aws-smithy-types v0.47.0 (*)
│   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-smithy-types v0.47.0 (*)
│   ├── aws-sdk-sso v0.17.0
│   │   ├── aws-endpoint v0.47.0
│   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-types v0.47.0 (*)
│   │   ├── aws-http v0.47.0 (*)
│   │   ├── aws-sig-auth v0.47.0
│   │   │   ├── aws-sigv4 v0.47.0
│   │   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-smithy-http v0.47.0 (*)
│   │   │   ├── aws-types v0.47.0 (*)
│   │   ├── aws-smithy-async v0.47.0 (*)
│   │   ├── aws-smithy-client v0.47.0 (*)
│   │   ├── aws-smithy-http v0.47.0 (*)
│   │   ├── aws-smithy-http-tower v0.47.0 (*)
│   │   ├── aws-smithy-json v0.47.0
│   │   │   └── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-types v0.47.0 (*)
│   ├── aws-sdk-sts v0.17.0
│   │   ├── aws-endpoint v0.47.0 (*)
│   │   ├── aws-http v0.47.0 (*)
│   │   ├── aws-sig-auth v0.47.0 (*)
│   │   ├── aws-smithy-async v0.47.0 (*)
│   │   ├── aws-smithy-client v0.47.0 (*)
│   │   ├── aws-smithy-http v0.47.0 (*)
│   │   ├── aws-smithy-http-tower v0.47.0 (*)
│   │   ├── aws-smithy-query v0.47.0
│   │   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-smithy-types v0.47.0 (*)
│   │   ├── aws-smithy-xml v0.47.0
│   │   ├── aws-types v0.47.0 (*)
│   ├── aws-smithy-async v0.47.0 (*)
│   ├── aws-smithy-client v0.47.0 (*)
│   ├── aws-smithy-http v0.47.0 (*)
│   ├── aws-smithy-http-tower v0.47.0 (*)
│   ├── aws-smithy-json v0.47.0 (*)
│   ├── aws-smithy-types v0.47.0 (*)
│   ├── aws-types v0.47.0 (*)
├── aws-sdk-dynamodb v0.17.0
│   ├── aws-endpoint v0.47.0 (*)
│   ├── aws-http v0.47.0 (*)
│   ├── aws-sig-auth v0.47.0 (*)
│   ├── aws-smithy-async v0.47.0 (*)
│   ├── aws-smithy-client v0.47.0 (*)
│   ├── aws-smithy-http v0.47.0 (*)
│   ├── aws-smithy-http-tower v0.47.0 (*)
│   ├── aws-smithy-json v0.47.0 (*)
│   ├── aws-smithy-types v0.47.0 (*)
│   ├── aws-types v0.47.0 (*)



### Environment details (OS name and version, etc.)

x86_64 GNU/Linux

### Logs

_No response_

Velfi commented 2 years ago

I'll take a look at this when I'm able (maybe next week, no promises 😄 ) In the meantime, you may have some success if you use tokio-console to take a look and see what task is blocking everything.

(I'm leaving the needs-triage label on this so it doesn't get swept under the rug.)

Velfi commented 2 years ago

I've been rather busy but I'm going to take a look at this issue this week.

Velfi commented 2 years ago

I've had a bit of trouble reliably reproducing this from your example code. Sometimes it runs fine and other times it gets stopped up after the first iteration of the sync loop just as you're experiencing.

Based on tokio-console, it really does look to be that "one bad task can hold up the executor" issue that I posted in my earlier comment.

I'm not sure what we can do about this other than wait for the tokio team to address the issue.

For anyone else that runs into an issue like this, try adding console_subscriber::init(); and running tokio-console to see if you're getting similar results.

I think it's important to track this issue but I've removed the bug tag since there's really nothing that we can do about this short of contributing a fix to tokio or changing the default executor in the SDK.

QnJ1c2kNCg commented 2 years ago

Firstly, thanks a ton for looking into this!

Sometimes it runs fine and other times it gets stopped up after the first iteration of the sync loop

Same here, the repro isn't 100% consistent.

This is very interesting, thanks for the tokio issue link.

Any intuition as to why creating a new client instead of cloneing seems to help? I was unable to reproduce while using two distinct clients. Using two clients is a decent enough mitigation for now.

I'll keep an eye on both these issues, thanks again.

jdisanti commented 2 years ago

Any intuition as to why creating a new client instead of cloneing seems to help? I was unable to reproduce while using two distinct clients. Using two clients is a decent enough mitigation for now.

The client is just using an Arc under the hood, so the clone ends up referencing the same hyper client and connection pool. Creating a new client will result in that client having its own hyper client. How that plays into this particular scenario, I don't know enough to answer.

JonathanHenson commented 1 year ago

Any chance someone can run this on valgrind with --tool=helgrind? It should print out the deadlock if it's a traditional "incorrect mutex usage" deadlock

gyuho commented 1 year ago

I've been seeing this in ec2::DescribeVolumes calls...

gyuho commented 1 year ago

Also seeing this in s3::PutObject

gyuho commented 1 year ago

@jdisanti I can sort of reliably reproduce this (happens 1 out of 10 runs). Any pointer to debug this further?

(also hangs on ec2::DescribeInstances)

gyuho commented 1 year ago

Setting operational timeouts (https://github.com/gyuho/aws-manager/commit/e4672f0f246b49569547793989bf9730460cc2f0) seem to mask this issue... haven't encountered this issue for two days so far 👍🏼

rcoh commented 1 year ago

I didn't see this above in the thread—does this occur on both multi threaded and single threaded runtimes? @Velfi do you have a full workspace I can go investigate?

stevedw commented 1 year ago

Given it seems to be one in 10 and setting a low timeout makes it go away - it suggests that a suspicious round number like that and overriding the timeout means its likely to be a broken connection in the pool that is hit on a round robin/random causing the problem.

Velfi commented 1 year ago

I didn't see this above in the thread—does this occur on both multi threaded and single threaded runtimes? @Velfi do you have a full workspace I can go investigate?

@rcoh I do not. I just took some test data and was working with that. I can't recall how single vs. multi-threading affected this.

zlfn commented 3 months ago

Same issue when using s3::put_object in async loop. In my case, it blocks everytime and timeout solution did not work. Spawning a tokio::task in every loop solved my problem.

async fn executor(...) {

    ...

    while !shutdown_rx.has_changed().unwrap() {
        //https://github.com/awslabs/aws-sdk-rust/issues/611
        task::spawn(
            receive_image(Arc::clone(&name), Arc::clone(&configs), Arc::clone(&aws_client), Arc::clone(&encode_rx))
        ).await.unwrap();
    }

    ...

}

awslabs / aws-sdk-rust