apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.58k stars 3.54k forks source link

[Python] AWS SDK 1.11 support in pyarrow wheels? #42154

Open johnkerl opened 5 months ago

johnkerl commented 5 months ago

Describe the bug, including details regarding any error messages, version, and platform.

With regard to https://github.com/apache/arrow/issues/40262, is there a plan to update pyarrow's AWS SDK dependency from 1.10 to 1.11? We believe from https://github.com/apache/arrow/blob/fe4d04f081e55ca2de7b1b67b10ad7dca96cfd9e/cpp/thirdparty/versions.txt#L54 that pyarrow is currently using 1.10:

ARROW_AWSSDK_BUILD_VERSION=1.10.55

It appears that a mitigation for https://github.com/apache/arrow/issues/40262 is in AWS SDK 1.11: https://github.com/aws/aws-sdk-cpp/pull/2710

(There's significant backstory on https://github.com/single-cell-data/TileDB-SOMA/pull/2692 and on https://github.com/TileDB-Inc/tiledbsoma-feedstock/pull/171, if backstory is desired. A repro is here: https://github.com/single-cell-data/TileDB-SOMA/pull/2692#issuecomment-2153450984.)

cc @pitrou

Component(s)

Python

johnkerl commented 5 months ago

Also cc @ihnorton @ivirshup @h-vetinari

h-vetinari commented 5 months ago

Conda-forge has been building against aws 1.11 for a long time already, and this also got synched back to the conda tests in arrow itself (which have bitrotted in the meantime, but there are efforts to revive them): https://github.com/apache/arrow/blob/801de2fbcf5bcbce0c019ed4b35ff3fc863b141b/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_version11.2.yaml#L3-L4

In any case, we run the full test suite on the python side (not the C++ side yet, c.f. https://github.com/apache/arrow/issues/35587), in every feedstock build, and it passes on osx. So I don't see the immediate incompatibility, which I assume is restricted to some corner cases.

You should provide a stacktrace (or ideally, a reproducer) of what fails.

PS. In the past there was once something that kept arrow stuck on aws 1.8 for a long time (which might help for context): https://github.com/aws/aws-sdk-cpp/issues/1809

h-vetinari commented 5 months ago

Looks like https://github.com/TileDB-Inc/TileDB-Py/issues/1990 is relevant, but again, you should really provide an example where arrow crashes or does something wrong, not another downstream project. The fact that the import order seems to matter is already ground for suspecting that there's something else going on here.

ihnorton commented 4 months ago

The specific question here is if/when will arrow wheels update to AWS SDK 1.11? The reason for the question is to understand whether the mitigation for the issue described below will be available "soon", or we need to work around it (rename symbols, further patch the AWS SDK, etc?).

For more background on the issue:

Summarizing: AWS has released a mitigation for the abort, implemented here: https://github.com/aws/aws-sdk-cpp/pull/2710. The mitigation is available in AWS SDK 1.11. TileDB wheels have updated to AWS SDK 1.11, but AFAICT all packages need to be updated for the mitigation to work.

This issue will likely impact any other library that bundles the AWS SDK in a wheel and is loaded at the same time as pyarrow.

kou commented 4 months ago

PyArrow wheels don't use bundled AWS SDK for C++. It uses vcpkg's one:

https://github.com/ursacomputing/crossbow/actions/runs/9544500563/job/26303310398#step:7:559

-- Found AWS SDK for C++, Version: 1.11.201, Install Root:/opt/vcpkg/installed/amd64-linux-static-release, Platform Prefix:, Platform Dependent Libraries: pthread;crypto;ssl;z;curl
jorisvandenbossche commented 4 months ago

The VCPKG version is currently pinned at 2023.11.20:

https://github.com/apache/arrow/blob/eec6f17c8879b469dc3370dad4a7f68f11705a6b/.env#L92

(last updated in https://github.com/apache/arrow/pull/39622)

This could certainly use another update to a more recent vcpkg state (EDIT: this is currently being done in https://github.com/apache/arrow/pull/42171), but so that release (as @kou also showed from the logs) already included AWS SDK 1.11 (https://github.com/microsoft/vcpkg/releases/tag/2023.11.20, it updated it from 1.11.169#2 to 1.11.201)

jorisvandenbossche commented 4 months ago

FWIW, this also means that the latest pyarrow wheels for 16.0.0 should actually already include AWS SDK 1.11

When pyarrow and tiledb-py are installed from PyPI, and imported in the same process, making S3 requests (which happens via the AWS SDK) causes an abort on some platforms.

@ihnorton the crashes you see, is that with the latest pyarrow release from PyPI?

ihnorton commented 4 months ago

@ihnorton the crashes you see, is that with the latest pyarrow release from PyPI?

Yes:

pyarrow                   16.1.0                   pypi_0    pypi
python                    3.12.4          h30c5eda_0_cpython    conda-forge
readline                  8.2                  h92ec313_1    conda-forge
setuptools                70.0.0             pyhd8ed1ab_0    conda-forge
tiledb                    0.30.0                   pypi_0    pypi
ihnorton commented 4 months ago

This could certainly use another update to a more recent vcpkg state already included AWS SDK 1.11 (https://github.com/microsoft/vcpkg/releases/tag/2023.11.20, it updated it from 1.11.169#2 to 1.11.201)

@jorisvandenbossche thanks for the explanation. It looks like the commit I referenced didn't actually make it in to the SDK until 1.11.179: https://github.com/aws/aws-sdk-cpp/commit/1f49f91a97cdc6556c0441010662a3647a3e1480

(EDIT: this is currently being done in https://github.com/apache/arrow/pull/42171), but so that release (as @kou also showed from the logs)

Thanks for the pointer! We'll sit tight and try this again after wheels are released with that update. Much appreciated.

jorisvandenbossche commented 4 months ago

It looks like the commit I referenced didn't actually make it in to the SDK until 1.11.179: aws/aws-sdk-cpp@1f49f91

That should still mean this is included in the pyarrow 16.0.0 wheels, AFAIK (because it should have used 1.11.201)

johnkerl commented 4 months ago

We have tested 16.0 and 17.0-rc and we still see the issue observed in #40262 -- which appears to be waiting for user confirmation. I'll comment there to indicate we believe the referenced AWS SDK commit does not fix the issue.

teo-tsirpanis commented 3 months ago

Is it possible that the multiple AWS SDK confusion would be resolved if the AWS SDKs inside the wheels were compiled with -fvisibility=hidden?

teo-tsirpanis commented 3 months ago

Is it possible that the multiple AWS SDK confusion would be resolved if the AWS SDKs inside the wheels were compiled with -fvisibility=hidden?

The answer is almost definitely yes. Building a custom pyarrow wheel is quite hard for me, but I verified it with the opposite case by following these steps:

teo-tsirpanis commented 3 months ago

Turns out just updating TileDB fixes this issue, but it still will be good to update Arrow to hide the symbols from the AWS SDK, to avoid potentially clashing with another library in the future. I am not planning to do that.