apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.17k stars 3.46k forks source link

[Python] Cannot read data if endpoint is s3 on a "secure" Minio server #40754

Open thinkORo opened 5 months ago

thinkORo commented 5 months ago

Describe the usage question you have. Please include as many useful details as possible.

I would like to read a CSV file from my TLS-secured object storage (minio).

Here is my code and configuration:

from pyarrow import fs, csv, parquet
minio = fs.S3FileSystem(
    endpoint_override="https://localhost:port",
    access_key="user1234",
    secret_key="password1234",
)

dataCSV = minio.open_input_file("bucket/filename.csv")

I get the following error message:

OSError: When reading information for key 'filename.csv' in bucket 'bucket': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 60, SSL peer certificate or SSH remote key was not OK

If I use the same credentials to read the data with duckDB, it works perfectly.

What am I doing wrong?

Component(s)

Python

thinkORo commented 5 months ago

I tried all parameter I found on the internet to disable certificate validation or to povide a valid certificate to S3FileSystem directly. No success.

Please, does have anyone any idea?

kou commented 5 months ago

I tried all parameter I found on the internet to disable certificate validation or to povide a valid certificate to S3FileSystem directly.

Could you share what you did?

thinkORo commented 5 months ago

Sure. Here is a list of parameters that unfortunately did not work

import os
os.environ['TLS_SKIP_VERIFY']="TRUE"
os.environ['TLS_VERIFY']="FALSE"
os.environ['VERIFY_CLIENT']="FALSE"
os.environ['VERIFY']="FALSE"
os.environ['CURLOPT_SSL_VERIFYHOST']="FALSE"
os.environ['CURLOPT_SSL_VERIFYPEER']="FALSE"

os.environ['REQUESTS_CA_BUNDLE']='path_to_my.pem'
os.environ['AWS_CA_BUNDLE']='path_to_my.pem'
os.environ['CURL_CA_BUNDLE']='path_to_my.pem'
os.environ['ARROW_SSL_CERT_FILE']='path_to_my.crt'
os.environ['SSL_CERT_FILE']='path_to_my.crt'

From my perspective the problem is that the S3FileSystem implementation of pyarrow.fs has a different "signature" as s3fs. In s3fs.S3FileSystem I can define client_kwargs to provide a specific certificate. In pyarrow's implementation I couldn't find a respective way.

But: My initial problem was related to pyiceberg (which is based on pyarrow). And there I got a hint to check the ssl verification parameter by

import ssl
paths = ssl.get_default_verify_paths()
print(paths)

I'm running Python in a virtual environment. Here the "openssl_cafile" from ssl points to a cert.pem in the virtual environment (./envs/name_of_my_venv/ssl/cert.pem) which I have to adjust (add the content of my CA's pem).

And with that adjustment I got it to work. More or less :-) But at least without the initial mentioned certificate problem.

kou commented 5 months ago

Thanks.

It seems that we need to add TLS related options to https://github.com/apache/arrow/blob/e3b0bd1feb63d59cd6fb553af976449397b8348e/cpp/src/arrow/filesystem/s3fs.h#L102 and use it for internal AWS SDK for C++.

thinkORo commented 5 months ago

Thank's for double-checking. Should I close this issue now or what's the best next step? Thank you for your support. For you for developing/supporting such a great framework.

kou commented 5 months ago

Please keep this issue open. The next step is opening a PR that improves S3Options.

kou commented 5 months ago

37001 may be related.

sadum-vunet commented 2 months ago

is there a workaround for this?