awslabs / aws-sdk-rust

AWS SDK for the Rust Programming Language
https://awslabs.github.io/aws-sdk-rust/
Apache License 2.0
2.91k stars 245 forks source link

DNS resolution error when fetching more than 200 objects concurrently from Amazon S3 #1136

Closed klefevre closed 1 month ago

klefevre commented 2 months ago

Describe the bug

Hello,

I'm facing an issue when I attempt to fetch a large number of objects concurrently (more than 200 in my case) from Amazon S3. The error I'm getting is: dns error: failed to lookup address information: nodename nor servname provided, or not known.

Expected Behavior

Being able to fetch concurrently thousands of objects? I should be able to reach the limit of my system.

Current Behavior

I get the io error: dns error: failed to lookup address information: nodename nor servname provided, or not known. when I try to fetch concurrently more than 200 objects.

Reproduction Steps

To replicate the problem, I've created a minimal project that you can find here:

https://github.com/klefevre/aws-sdk-rust-concurrency-issue

Note that to replicate, a valid S3 bucket is necessary with a bunch of files inside.

Possible Solution

No response

Additional Information/Context

No response

Version

aws-sdk-rust-concurrent-issue v0.1.0 (/tmp/aws-sdk-rust-concurrent-issue)
├── aws-config v1.2.1
│   ├── aws-credential-types v1.2.0
│   │   ├── aws-smithy-async v1.2.1
│   │   ├── aws-smithy-runtime-api v1.4.0
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.2.0
│   │   ├── aws-credential-types v1.2.0 (*)
│   │   ├── aws-sigv4 v1.2.1
│   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.8
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.8 (*)
│   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.2.0
│   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.21.0
│   │   ├── aws-credential-types v1.2.0 (*)
│   │   ├── aws-runtime v1.2.0 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.8 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.3.1
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.2.0 (*)
│   ├── aws-sdk-ssooidc v1.21.0
│   │   ├── aws-credential-types v1.2.0 (*)
│   │   ├── aws-runtime v1.2.0 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.8 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.3.1 (*)
│   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.2.0 (*)
│   ├── aws-sdk-sts v1.21.0
│   │   ├── aws-credential-types v1.2.0 (*)
│   │   ├── aws-runtime v1.2.0 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.8 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.3.1 (*)
│   │   ├── aws-smithy-runtime-api v1.4.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.8
│   │   ├── aws-types v1.2.0 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-http v0.60.8 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.3.1 (*)
│   ├── aws-smithy-runtime-api v1.4.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.2.0 (*)
├── aws-sdk-s3 v1.24.0
│   ├── aws-credential-types v1.2.0 (*)
│   ├── aws-runtime v1.2.0 (*)
│   ├── aws-sigv4 v1.2.1 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-checksums v0.60.7
│   │   ├── aws-smithy-http v0.60.8 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.8 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.3.1 (*)
│   ├── aws-smithy-runtime-api v1.4.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.8 (*)
│   ├── aws-types v1.2.0 (*)

Environment details (OS name and version, etc.)

OS: macOS 14.1.2 23B92 arm64, Kernel: 23.1.0

Logs

No response

ysaito1001 commented 2 months ago

Hi @klefevre, thank you for reporting this. We've been able to reproduce the issue on our end. Will look into this further.

ysaito1001 commented 2 months ago

FYI, while we've reproduced it, it does not seem to be consistently reproducible. For instance, when I run the main function in aws-sdk-rust-concurrency-issue 4 times, with const CONCURRENCY_LIMIT: usize = 200 and const RANGE: Range<u32> = 1..2000 I get

1st run (success)
➜  aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r        
    Finished dev [unoptimized + debuginfo] target(s) in 3.40s
     Running `target/debug/aws-sdk-rust-concurrent-issue`

2nd run (connection timeout)
➜  aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r        
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/aws-sdk-rust-concurrent-issue`
Error: Failed to collect objects

Caused by:
    0: Failed to fetch object 1
    1: dispatch failure
    2: timeout
    3: error trying to connect: HTTP connect timeout occurred after 3.1s
    4: HTTP connect timeout occurred after 3.1s
    5: timed out

3rd run (success)
➜  aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/aws-sdk-rust-concurrent-issue`

4th run (dns error)
➜  aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r
    Finished dev [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/aws-sdk-rust-concurrent-issue`
Error: Failed to collect objects

Caused by:
    0: Failed to fetch object 281
    1: dispatch failure
    2: io error
    3: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known

This seems to indicate that we're pushing the use of underlying resources to the limit where it may or may not behave reliably. Could you explain what makes you use 200 (or more) concurrent GetObject?

brainstorm commented 1 month ago

Could you explain what makes you use 200 (or more) concurrent GetObject

Chiming in as well since this helps my current use case. We're trying to migrate millions of objects as fast as we can from a foreign S3 bucket (on a third party's AWS account), some of those objects being bigger than 800GB, so we'll be leveraging this blogpost to max out the amount of concurrent connections too, not only the amount of concurrent S3 Batch Operations objects.

OTOH, it seems that in order to implement the aforementioned AWS blogpost in Rust we'll be limited by this other issue first, though: https://github.com/awslabs/aws-sdk-rust/issues/968, we'd need to have those parameters configurable:

max_concurrency: 940, max_retries: 100, max_pool_connections: 940 and multipart_chunksize: 16777216.
klefevre commented 1 month ago

I found a resolution to the problem. I was actually expecting the SDK to return only two kinds of errors in this scenario: network issues with S3 itself or a failure to open a file descriptor due to a limit reached on my hardware.

While writing this response, I checked the maximum number of file descriptors I could open with the command ulimit -n, and it turns out to be 256 by default (at least on macOS). By removing this limit i.e. ulimit -n unlimited, I no longer encounter errors. 🙌

I'm closing this issue because the SDK behaves correctly. However it would be nice to add the root cause of this dns error for better clarity and troubleshooting in the future.

github-actions[bot] commented 1 month ago

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.