TileDB-Inc / TileDB

The Universal Storage Engine
https://tiledb.com
MIT License
1.86k stars 185 forks source link

faketime tests stall when S3 support is enabled #4956

Open jdblischak opened 5 months ago

jdblischak commented 5 months ago

x-ref https://app.shortcut.com/tiledb-inc/story/47373


In my nightly builds, if I build with S3 support (TILEDB_S3=ON), the faketime tests added in https://github.com/TileDB-Inc/TileDB/pull/4883 stall. They pass when S3 support is disabled.

xref: https://github.com/jdblischak/centralized-tiledb-nightlies/issues/5#issuecomment-2096691606

Below is a reproducible example I created in an Ubuntu Docker image. Note that I had to install more dependencies than those listed in the Prerequisites of BUILD.md.

docker run --rm -it ubuntu:22.04

# Setup
apt-get update
apt-get install --yes cmake curl g++ gcc git pkg-config tar unzip zip
git clone https://github.com/TileDB-Inc/TileDB.git
cd TileDB
git log -n 1 --oneline
## 9a4b88517 (HEAD -> dev, origin/dev, origin/HEAD) Remove deprecated APIs from tests, part 2. (#4951)

# without S3 support
cmake -B build-libtiledb \
  -D TILEDB_WERROR=ON \
  -D TILEDB_SERIALIZATION=ON \
  -D CMAKE_BUILD_TYPE=Release \
  -D VCPKG_TARGET_TRIPLET=x64-linux-release

cmake --build build-libtiledb -j $(nproc) --config Release

cmake --build build-libtiledb --target check --config Release
## 2: 0.066 s: C API: Test consolidation, fragments/commits out of order
## 2: 0.000 s: - sparse array
## 2:
## 2: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## 2: tiledb_unit is a Catch2 v3.4.0 host application.
## 2: Run with -? for options
## 2:
## 2: -------------------------------------------------------------------------------
## 2: C API: Test vacuuming leaves array dir in a consistent state
## 2: -------------------------------------------------------------------------------
## 2: /TileDB/test/src/unit-capi-consolidation.cc:7378
## 2: ...............................................................................
## 2:
## 2: /TileDB/test/src/unit-capi-consolidation.cc:4656: FAILED:
## 2:   REQUIRE( rc == (expect_fail ? (-1) : (0)) )
## 2: with expansion:
## 2:   0 == -1
## 2:
## 2: terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
## 2:   what():  filesystem error: cannot set permissions: No such file or directory [/TileDB/test/tiledb_test/test_consolidate_sparse_array/__fragments/__1715102401556_1715102401556_5b878658b88da5efa423d8eb94514abe_21]
## 2: /TileDB/test/src/unit-capi-consolidation.cc:4656: FAILED:
## 2:   {Unknown expression after the reported line}
## 2: due to a fatal error condition:
## 2:   SIGABRT - Abort (abnormal termination) signal
## 2:
## 2: 0.000 s: C API: Test vacuuming leaves array dir in a consistent state
## 2: ===============================================================================
## 2: test cases:     341 |     340 passed | 1 failed
## 2: assertions: 2798035 | 2798033 passed | 2 failed
## 2:
## 2/3 Test #2: tiledb_unit ......................Subprocess aborted***Exception:  51.68 sec

# I don't know what the above test failure is about. I haven't observed it
# in my nightlies

# with S3 support
cmake -B build-libtiledb-s3 \
  -D TILEDB_WERROR=ON \
  -D TILEDB_SERIALIZATION=ON \
  -D CMAKE_BUILD_TYPE=Release \
  -D VCPKG_TARGET_TRIPLET=x64-linux-release \
  -D TILEDB_S3=ON

cmake --build build-libtiledb-s3 -j $(nproc) --config Release

cmake --build build-libtiledb-s3 --target check --config Release
## [100%] Built target tests
## UpdateCTestConfiguration  from :/TileDB/build-libtiledb-s3/tiledb/test/DartConfiguration.tcl
## UpdateCTestConfiguration  from :/TileDB/build-libtiledb-s3/tiledb/test/DartConfiguration.tcl
## Test project /TileDB/build-libtiledb-s3/tiledb/test
## Constructing a list of tests
## Done constructing a list of tests
## Updating test list for fixtures
## Added 0 tests to meet fixture requirements
## Checking test dependency graph...
## Checking test dependency graph end
## test 1
##     Start 1: tiledb_timing_unit
##
## 1: Test command: /TileDB/build-libtiledb-s3/tiledb/test/tiledb_unit "--durations=yes" "[sub-millisecond]"
## 1: Environment variable modifications:
## 1:  FAKETIME=set:2020-12-24 20:30:00
## 1:  LD_PRELOAD=set:/TileDB/build-libtiledb-s3/vcpkg_installed/x64-linux-release/lib/libfaketime.so
## 1: Test timeout computed to be: 10000000

# it stalls at this point and never completes
KiterLuc commented 5 months ago

@teo-tsirpanis can you take a look?

teo-tsirpanis commented 5 months ago

Will take a look tomorrow.

teo-tsirpanis commented 5 months ago

@jdblischak do you have credentials for S3 configured? It is likely that because S3 is enabled, tiledb_unit tries to connect and it stalls due to retrying. You can make the tests run on local files by passing the --vfs native option.

<opinion> I don't like this behavior. tiledb_unit should by default run only on local files (or maybe even better MemFS but there are some issues that currently prevent it) and running tests on cloud services should be opt-in. We should also remove support for running tests on many cloud services at once; it has few valid use cases (you can just run tiledb_unit many times) and its existence complicates some code. </opinion>

jdblischak commented 5 months ago

do you have credentials for S3 configured?

No, and I'd prefer not to

It is likely that because S3 is enabled, tiledb_unit tries to connect and it stalls due to retrying.

That makes sense. Thanks for diagnosing the problem!

You can make the tests run on local files by passing the --vfs native option.

Can I pass that flag directly to CMake? Here is how I am currently invoking the TileDB tests:

cmake --build build-libtiledb --target check --config Release

# Is this right?
cmake --build build-libtiledb --target check --config Release --vfs native
teo-tsirpanis commented 5 months ago

It's not currently possible within CMake; you will have to manually run ./build-libtiledb/tiledb/test/tiledb_unit --vfs native.

jdblischak commented 5 months ago

you will have to manually run ./build-libtiledb/tiledb/test/tiledb_unit --vfs native

How can I manually build the tests in order to manually run them?

docker run --rm -it ubuntu:22.04

# Setup
apt-get update
apt-get install --yes cmake curl g++ gcc git pkg-config tar unzip zip
git clone https://github.com/TileDB-Inc/TileDB.git
cd TileDB
git log -n 1 --oneline
## 474fc1ef5 (HEAD -> dev, origin/dev, origin/HEAD) Migrate APIs out of StorageManager: array_get_encryption. (#4950)

# with S3 support
cmake -B build-libtiledb-s3 \
  -D TILEDB_WERROR=ON \
  -D TILEDB_SERIALIZATION=ON \
  -D CMAKE_BUILD_TYPE=Release \
  -D VCPKG_TARGET_TRIPLET=x64-linux-release \
  -D TILEDB_S3=ON

cmake --build build-libtiledb-s3 -j $(nproc) --config Release

./build-libtiledb/tiledb/test/tiledb_unit --vfs native
## bash: ./build-libtiledb/tiledb/test/tiledb_unit: No such file or directory

I tried searching the CMake files, but didn't have much luck. I found where the custom target check is defined, but it just calls cmake with --target check, which is circular:

https://github.com/TileDB-Inc/TileDB/blob/474fc1ef554bb954c1fc8b460b0a002bf8d98df0/cmake/TileDB-Superbuild.cmake#L176-L182

I also tried the targets tests and ordinary_unit_tests, but neither of those worked at all.

https://github.com/TileDB-Inc/TileDB/blob/474fc1ef554bb954c1fc8b460b0a002bf8d98df0/CMakeLists.txt#L550-L558

cmake --build build-libtiledb-s3 --target tests --config Release
## gmake: *** No rule to make target 'tests'.  Stop.

cmake --build build-libtiledb-s3 --target ordinary_unit_tests --config Release
## gmake: *** No rule to make target 'ordinary_unit_tests'.  Stop.
teo-tsirpanis commented 5 months ago

it just calls cmake with --target check, which is circular

CMake is being called in the tiledb subdirectory. You will have to first run a build on the outer build directory, and then to build/tiledb. Here's an example of how we do it in CI: https://github.com/TileDB-Inc/TileDB/blob/7387605811b82e994a896a051741e6afd5946b0e/.github/workflows/unit-test-runs.yml#L56-L64

jdblischak commented 5 months ago

You will have to first run a build on the outer build directory, and then to build/tiledb

@teo-tsirpanis Thanks for the explanation! I was able to build and execute the tests locally. However, there were 3 failed tests. Are these known failures?

docker run --rm -it ubuntu:22.04

# Setup
apt-get update
apt-get install --yes cmake curl g++ gcc git pkg-config tar unzip zip
git clone https://github.com/TileDB-Inc/TileDB.git
cd TileDB
git log -n 1 --oneline
## 8de7d1ca4 (HEAD -> dev, origin/dev, origin/HEAD) Add v2_23_0 arrays to backward compatibility matrix. (#4965)

# with S3 support
cmake -B build-libtiledb-s3 \
  -D TILEDB_WERROR=ON \
  -D TILEDB_SERIALIZATION=ON \
  -D CMAKE_BUILD_TYPE=Release \
  -D VCPKG_TARGET_TRIPLET=x64-linux-release \
  -D TILEDB_S3=ON

cmake --build build-libtiledb-s3 -j $(nproc) --config Release

# Build unit tests
make -C build-libtiledb-s3/tiledb tests -j $(nproc)
## [100%] Building CXX object test/CMakeFiles/tiledb_unit.dir/src/unit.cc.o
## [100%] Built target ordinary_unit_tests
## [100%] Built target all_unit_tests
## [100%] Linking CXX executable tiledb_unit
## [100%] Built target tiledb_unit
## [100%] Built target tests

# Run unit tests
./build-libtiledb-s3/tiledb/test/tiledb_unit --vfs native
## Randomness seeded to: 158965112
##
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## tiledb_unit is a Catch2 v3.4.0 host application.
## Run with -? for options
##
## -------------------------------------------------------------------------------
## C API: Test vacuuming leaves array dir in a consistent state
## -------------------------------------------------------------------------------
## /TileDB/test/src/unit-capi-consolidation.cc:7378
## ...............................................................................
##
## /TileDB/test/src/unit-capi-consolidation.cc:4656: FAILED:
##   REQUIRE( rc == (expect_fail ? (-1) : (0)) )
## with expansion:
##   0 == -1
##
## terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
##   what():  filesystem error: cannot set permissions: No such file or directory [/TileDB/tiledb_test/test_consolidate_sparse_array/__fragments/__1715616656780_1715616656780_71deef232b74224ba6dfd2553ec29ece_21]
## /TileDB/test/src/unit-capi-consolidation.cc:4656: FAILED:
##   {Unknown expression after the reported line}
## due to a fatal error condition:
##   SIGABRT - Abort (abnormal termination) signal
##
## ===============================================================================
## test cases:     341 |     340 passed | 1 failed
## assertions: 3084887 | 3084885 passed | 2 failed
##
## Aborted
jdblischak commented 5 months ago

Next I tried running the tests this way in my nightly setup.

Without S3 support, the tests passed

All tests passed (13941361 assertions in 1427 test cases)

However, when I enabled S3 support, there were even more test failures than when I ran it locally above:

test cases:     1434 |     1409 passed | 25 failed
assertions: 13863516 | 13863491 passed | 25 failed
teo-tsirpanis commented 5 months ago

Apparently --vfs has no effect in certain newly added tests, see footnote of https://github.com/TileDB-Inc/TileDB/pull/4126#issuecomment-1992664994. (tracked internally in SC-47373)

jdblischak commented 5 months ago

Ok, so it seems that currently it is not possible to run the test suite with TILEDB_S3=ON if you haven't configured S3 credentials.

jdblischak commented 5 months ago

Ok, for now I enabled S3 support in order to run the tiledbvcf-py tests and stopped running the libtiledb tests (https://github.com/jdblischak/centralized-tiledb-nightlies/commit/3febc7b5b1f60a92a13c9258d894495883305390). Ideally in the future I can update this to run a subset of the tests that don't require S3 authentication