apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.36k stars 3.49k forks source link

[C++] Writing to AWS S3 fails when using `aws sso` (13.0.0) #38013

Open jayceslesar opened 12 months ago

jayceslesar commented 12 months ago

Describe the bug, including details regarding any error messages, version, and platform.

My org recently switched to using aws sso to configure all of our access and when making the swap, we noticed that we are no longer able to write to S3 when using these types of credentials. When explicitly setting the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID with previous IAM User accounts, the problem does not exist. For reference, once authenticating with aws sso login and attempting to write some data, I am thrown the following error:

  File "pyarrow/_dataset.pyx", line 3844, in pyarrow._dataset._filesystemdataset_write
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When testing for existence of bucket 'my-s3-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Note also that eval $(aws configure export-credentials --format env) also does not seem to work as those are AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID from the sso login, which once exported also raise the same issue

I have confirmed that this is not an issue in 12.0.1, only 13.0.0

Component(s)

Parquet

rdbisme commented 11 months ago

Wrapping the call around fsspec should work. But it's potentially slower and more memory hungry.

jayceslesar commented 11 months ago

Wrapping the call around fsspec should work. But it's potentially slower and more memory hungry.

@rdbisme can you share what you mean with an example?

rdbisme commented 11 months ago
s3_path = "s3://my-bucket/path"
with fsspec.open(s3_path, "wb") as file_stream:
              df.to_parquet(file_stream)

should work (didn't test, but I think it does). @jayceslesar let me know!

jayceslesar commented 11 months ago

Thanks @rdbisme! it also looks like explicitly passing in an initialized s3fs.S3FileSystem class works as well

drauschenbach commented 6 months ago

C++ hack:

$ eval $(aws configure export-credentials --format env)
std::string access_key = std::getenv("AWS_ACCESS_KEY_ID");
std::string secret_key = std::getenv("AWS_SECRET_ACCESS_KEY");
std::string session_token = std::getenv("AWS_SESSION_TOKEN");

arrow::fs::S3Options options;
options.ConfigureAccessKey(access_key, secret_key, session_token)
pitrou commented 5 months ago

As a starting point, someone should post in that issue 1) what they are doing precisely 2) how it can/should be made to work. Currently the discussion seems to assume Arrow developers are S3 experts, which they are not.