Open johnkerl opened 5 months ago
Also cc @ihnorton @ivirshup @h-vetinari
Conda-forge has been building against aws 1.11 for a long time already, and this also got synched back to the conda tests in arrow itself (which have bitrotted in the meantime, but there are efforts to revive them): https://github.com/apache/arrow/blob/801de2fbcf5bcbce0c019ed4b35ff3fc863b141b/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_version11.2.yaml#L3-L4
In any case, we run the full test suite on the python side (not the C++ side yet, c.f. https://github.com/apache/arrow/issues/35587), in every feedstock build, and it passes on osx. So I don't see the immediate incompatibility, which I assume is restricted to some corner cases.
You should provide a stacktrace (or ideally, a reproducer) of what fails.
PS. In the past there was once something that kept arrow stuck on aws 1.8 for a long time (which might help for context): https://github.com/aws/aws-sdk-cpp/issues/1809
Looks like https://github.com/TileDB-Inc/TileDB-Py/issues/1990 is relevant, but again, you should really provide an example where arrow crashes or does something wrong, not another downstream project. The fact that the import order seems to matter is already ground for suspecting that there's something else going on here.
The specific question here is if/when will arrow wheels update to AWS SDK 1.11? The reason for the question is to understand whether the mitigation for the issue described below will be available "soon", or we need to work around it (rename symbols, further patch the AWS SDK, etc?).
For more background on the issue:
When pyarrow and tiledb-py are installed from PyPI, and imported in the same process, making S3 requests (which happens via the AWS SDK) causes an abort on some platforms.
Standalone repro: https://gist.github.com/ihnorton/790b575944a5d09674e86a10700f1dab
The top of the stack trace is:
1 libarrow.1601.dylib 0x000000010e08b8d0 aws_fatal_assert + 80
2 libarrow.1601.dylib 0x000000010e08ab98 aws_mem_acquire + 64
3 libarrow.1601.dylib 0x000000010e09dd68 aws_string_new_from_cursor + 76
4 libarrow.1601.dylib 0x000000010e0975e4 aws_json_value_get_from_object + 44
5 libarrow.1601.dylib 0x000000010e083970 aws_endpoints_ruleset_new_from_string + 120
6 libarrow.1601.dylib 0x000000010e01d5a4 _ZN3Aws3Crt9Endpoints10RuleEngineC2ERK15aws_byte_cursorS5_P13aws_allocator + 48
7 libarrow.1601.dylib 0x000000010ddf7180 _ZN3Aws8Endpoint23DefaultEndpointProviderINS_2S321S3ClientConfigurationENS2_8Endpoint19S3BuiltInParametersENS4_25S3ClientContextParametersEEC2EPKcm + 116
8 libtiledb.dylib 0x0000000162bf367c _ZN3Aws2S38S3ClientC2ERKNS_6Client19ClientConfigurationENS2_15AWSAuthV4Signer20PayloadSigningPolicyEbNS0_34US_EAST_1_REGIONAL_ENDPOINT_OPTIONE + 980
9 libtiledb.dylib 0x000000016241a4bc _ZN6tiledb6common11make_sharedINS_2sm14TileDBS3ClientELi66EJRKNS2_12S3ParametersERN3Aws6Client19ClientConfigurationENS8_15AWSAuthV4Signer20PayloadSigningPolicyERKbEEENSt3__110shared_ptrIT_EERAT0__KcDpOT1_ + 92
10 libtiledb.dylib 0x0000000162403624 _ZNK6tiledb2sm2S311init_clientEv + 3428
This pre-existing arrow issue describes the underlying cause of the stack trace above: https://github.com/apache/arrow/issues/40262
There is extensive prior discussion elsewhere on the AWS SDK repro, including https://github.com/aws/aws-sdk-cpp/issues/2699
This is unrelated to conda, where -- as @h-vetinari notes -- the AWS SDK version is up-to-date; additionally, there is only a single copy of the SDK in the process image when both are installed from conda-forge.
Summarizing: AWS has released a mitigation for the abort, implemented here: https://github.com/aws/aws-sdk-cpp/pull/2710. The mitigation is available in AWS SDK 1.11. TileDB wheels have updated to AWS SDK 1.11, but AFAICT all packages need to be updated for the mitigation to work.
This issue will likely impact any other library that bundles the AWS SDK in a wheel and is loaded at the same time as pyarrow.
PyArrow wheels don't use bundled AWS SDK for C++. It uses vcpkg's one:
https://github.com/ursacomputing/crossbow/actions/runs/9544500563/job/26303310398#step:7:559
-- Found AWS SDK for C++, Version: 1.11.201, Install Root:/opt/vcpkg/installed/amd64-linux-static-release, Platform Prefix:, Platform Dependent Libraries: pthread;crypto;ssl;z;curl
The VCPKG version is currently pinned at 2023.11.20:
https://github.com/apache/arrow/blob/eec6f17c8879b469dc3370dad4a7f68f11705a6b/.env#L92
(last updated in https://github.com/apache/arrow/pull/39622)
This could certainly use another update to a more recent vcpkg state (EDIT: this is currently being done in https://github.com/apache/arrow/pull/42171), but so that release (as @kou also showed from the logs) already included AWS SDK 1.11 (https://github.com/microsoft/vcpkg/releases/tag/2023.11.20, it updated it from 1.11.169#2 to 1.11.201)
FWIW, this also means that the latest pyarrow wheels for 16.0.0 should actually already include AWS SDK 1.11
When pyarrow and tiledb-py are installed from PyPI, and imported in the same process, making S3 requests (which happens via the AWS SDK) causes an abort on some platforms.
@ihnorton the crashes you see, is that with the latest pyarrow release from PyPI?
@ihnorton the crashes you see, is that with the latest pyarrow release from PyPI?
Yes:
pyarrow 16.1.0 pypi_0 pypi
python 3.12.4 h30c5eda_0_cpython conda-forge
readline 8.2 h92ec313_1 conda-forge
setuptools 70.0.0 pyhd8ed1ab_0 conda-forge
tiledb 0.30.0 pypi_0 pypi
This could certainly use another update to a more recent vcpkg state already included AWS SDK 1.11 (https://github.com/microsoft/vcpkg/releases/tag/2023.11.20, it updated it from 1.11.169#2 to 1.11.201)
@jorisvandenbossche thanks for the explanation. It looks like the commit I referenced didn't actually make it in to the SDK until 1.11.179
: https://github.com/aws/aws-sdk-cpp/commit/1f49f91a97cdc6556c0441010662a3647a3e1480
(EDIT: this is currently being done in https://github.com/apache/arrow/pull/42171), but so that release (as @kou also showed from the logs)
Thanks for the pointer! We'll sit tight and try this again after wheels are released with that update. Much appreciated.
It looks like the commit I referenced didn't actually make it in to the SDK until
1.11.179
: aws/aws-sdk-cpp@1f49f91
That should still mean this is included in the pyarrow 16.0.0 wheels, AFAIK (because it should have used 1.11.201)
We have tested 16.0 and 17.0-rc and we still see the issue observed in #40262 -- which appears to be waiting for user confirmation. I'll comment there to indicate we believe the referenced AWS SDK commit does not fix the issue.
Is it possible that the multiple AWS SDK confusion would be resolved if the AWS SDKs inside the wheels were compiled with -fvisibility=hidden
?
Is it possible that the multiple AWS SDK confusion would be resolved if the AWS SDKs inside the wheels were compiled with
-fvisibility=hidden
?
The answer is almost definitely yes. Building a custom pyarrow
wheel is quite hard for me, but I verified it with the opposite case by following these steps:
tiledb
wheels with the AWS SDK compiled with -fvisibility=hidden
.pip install
stock tiledb
and stock pyarrow
.pip uninstall tiledb
and install the custom wheel built in the first step.Turns out just updating TileDB fixes this issue, but it still will be good to update Arrow to hide the symbols from the AWS SDK, to avoid potentially clashing with another library in the future. I am not planning to do that.
Describe the bug, including details regarding any error messages, version, and platform.
With regard to https://github.com/apache/arrow/issues/40262, is there a plan to update
pyarrow
's AWS SDK dependency from 1.10 to 1.11? We believe from https://github.com/apache/arrow/blob/fe4d04f081e55ca2de7b1b67b10ad7dca96cfd9e/cpp/thirdparty/versions.txt#L54 thatpyarrow
is currently using 1.10:It appears that a mitigation for https://github.com/apache/arrow/issues/40262 is in AWS SDK 1.11: https://github.com/aws/aws-sdk-cpp/pull/2710
(There's significant backstory on https://github.com/single-cell-data/TileDB-SOMA/pull/2692 and on https://github.com/TileDB-Inc/tiledbsoma-feedstock/pull/171, if backstory is desired. A repro is here: https://github.com/single-cell-data/TileDB-SOMA/pull/2692#issuecomment-2153450984.)
cc @pitrou
Component(s)
Python