Closed klefevre closed 1 month ago
Hi @klefevre, thank you for reporting this. We've been able to reproduce the issue on our end. Will look into this further.
FYI, while we've reproduced it, it does not seem to be consistently reproducible. For instance, when I run the main function in aws-sdk-rust-concurrency-issue
4 times, with const CONCURRENCY_LIMIT: usize = 200
and const RANGE: Range<u32> = 1..2000
I get
1st run (success)
➜ aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r
Finished dev [unoptimized + debuginfo] target(s) in 3.40s
Running `target/debug/aws-sdk-rust-concurrent-issue`
2nd run (connection timeout)
➜ aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r
Finished dev [unoptimized + debuginfo] target(s) in 0.10s
Running `target/debug/aws-sdk-rust-concurrent-issue`
Error: Failed to collect objects
Caused by:
0: Failed to fetch object 1
1: dispatch failure
2: timeout
3: error trying to connect: HTTP connect timeout occurred after 3.1s
4: HTTP connect timeout occurred after 3.1s
5: timed out
3rd run (success)
➜ aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r
Finished dev [unoptimized + debuginfo] target(s) in 0.10s
Running `target/debug/aws-sdk-rust-concurrent-issue`
4th run (dns error)
➜ aws-sdk-rust-concurrency-issue git:(main) ✗ cargo r
Finished dev [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/aws-sdk-rust-concurrent-issue`
Error: Failed to collect objects
Caused by:
0: Failed to fetch object 281
1: dispatch failure
2: io error
3: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known
This seems to indicate that we're pushing the use of underlying resources to the limit where it may or may not behave reliably. Could you explain what makes you use 200 (or more) concurrent GetObject
?
Could you explain what makes you use 200 (or more) concurrent GetObject
Chiming in as well since this helps my current use case. We're trying to migrate millions of objects as fast as we can from a foreign S3 bucket (on a third party's AWS account), some of those objects being bigger than 800GB, so we'll be leveraging this blogpost to max out the amount of concurrent connections too, not only the amount of concurrent S3 Batch Operations objects.
OTOH, it seems that in order to implement the aforementioned AWS blogpost in Rust we'll be limited by this other issue first, though: https://github.com/awslabs/aws-sdk-rust/issues/968, we'd need to have those parameters configurable:
max_concurrency: 940, max_retries: 100, max_pool_connections: 940 and multipart_chunksize: 16777216.
I found a resolution to the problem. I was actually expecting the SDK to return only two kinds of errors in this scenario: network issues with S3 itself or a failure to open a file descriptor due to a limit reached on my hardware.
While writing this response, I checked the maximum number of file descriptors I could open with the command ulimit -n
, and it turns out to be 256 by default (at least on macOS). By removing this limit i.e. ulimit -n unlimited
, I no longer encounter errors. 🙌
I'm closing this issue because the SDK behaves correctly. However it would be nice to add the root cause of this dns error for better clarity and troubleshooting in the future.
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Describe the bug
Hello,
I'm facing an issue when I attempt to fetch a large number of objects concurrently (more than 200 in my case) from Amazon S3. The error I'm getting is:
dns error: failed to lookup address information: nodename nor servname provided, or not known.
Expected Behavior
Being able to fetch concurrently thousands of objects? I should be able to reach the limit of my system.
Current Behavior
I get the io error:
dns error: failed to lookup address information: nodename nor servname provided, or not known.
when I try to fetch concurrently more than 200 objects.Reproduction Steps
To replicate the problem, I've created a minimal project that you can find here:
https://github.com/klefevre/aws-sdk-rust-concurrency-issue
Note that to replicate, a valid S3 bucket is necessary with a bunch of files inside.
Possible Solution
No response
Additional Information/Context
No response
Version
Environment details (OS name and version, etc.)
OS: macOS 14.1.2 23B92 arm64, Kernel: 23.1.0
Logs
No response