TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
83 stars 13 forks source link

Import pyarrow-hotfix #669

Closed jdblischak closed 3 months ago

jdblischak commented 4 months ago

Until we can install pyarrow >=14.0.2 in the cloud conda environments, we should ensure that pyarrow-hotfix is installed and imported. From its PyPI Usage section:

pyarrow_hotfix must be imported in your application or library code for it to take effect

xref: https://github.com/TileDB-Inc/tiledb-vcf-feedstock/pull/115

Also, I noticed that pyarrow isn't included in setup_requires

https://github.com/TileDB-Inc/TileDB-VCF/blob/b3df71ea2796c878c3c8c34e3a393accc5cc6ea2/apis/python/setup.py#L284-L292

Which seems unlikely to be true given that pyarrow is imported in setup.py:

https://github.com/TileDB-Inc/TileDB-VCF/blob/b3df71ea2796c878c3c8c34e3a393accc5cc6ea2/apis/python/setup.py#L243

In general, is there a reason that setup_requires, install_requires, and test_requires aren't overly utilized in this repo? I expected to see at least pyarrow in install_requires and dask in test_requires.

jdblischak commented 4 months ago

Frustratingly linux-libtiledb-dev failed with a segmentation fault during the Python tests both here and on the run on my fork. I restarted both

jdblischak commented 4 months ago

linux-libtiledb-dev is consistently failing. Are any of the recent commits to libtiledb potentially affecting this?

Also I manually triggered a nightly build off of the main branch on my fork to see if these failures are at all related to my PR

jdblischak commented 4 months ago

Also I manually triggered a nightly build off of the main branch on my fork to see if these failures are at all related to my PR

Unfortunately the nightly on main passed 😭

jdblischak commented 4 months ago

Maybe it's the version of dask. The nightly on main installed dask 2023.9.3 from its cache. Whereas my PR is installing dask 2024.2.1

jdblischak commented 4 months ago

Restricting the upper bound for dask didn't fix it. Any ideas?

gspowley commented 3 months ago

Based on the test failures:

Current thread 0x000078a9fd4f9740 (most recent call first):
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/ctx.py", line 345 in __init__
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/ctx.py", line 545 in default_ctx
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/highlevel.py", line 337 in _get_ctx
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/highlevel.py", line 28 in open
  File "/home/runner/work/TileDB-VCF/TileDB-VCF/TileDB-VCF/apis/python/tests/test_tiledbvcf.py", line 975 in test_disable_ingestion_tasks

The failing test is trying to open an array (which doesn't exist) with tiledb-py: https://github.com/TileDB-Inc/TileDB-VCF/blob/b3df71ea2796c878c3c8c34e3a393accc5cc6ea2/apis/python/tests/test_tiledbvcf.py#L975-L976

I don't know why this causes a segfault, but the use of tiledb-py is complicating the test. We only need to check that the array path does not exist.

This patch simplifies the test:

diff --git a/apis/python/tests/test_tiledbvcf.py b/apis/python/tests/test_tiledbvcf.py
index 2e4fd3bd..87fd41b6 100755
--- a/apis/python/tests/test_tiledbvcf.py
+++ b/apis/python/tests/test_tiledbvcf.py
@@ -966,23 +966,12 @@ def test_disable_ingestion_tasks(tmp_path):
     if platform.system() != "Linux":
         return

-    # query allele_count array with TileDB
+    # ensure the allele_count and variant_stats arrays are not created
     ac_uri = os.path.join(tmp_path, "dataset", "allele_count")
+    assert not os.path.exists(ac_uri)

-    contig = "1"
-    region = slice(69896)
-    with pytest.raises(Exception):
-        with tiledb.open(ac_uri) as A:
-            df = A.query(attrs=["alt", "count"], dims=["pos"]).df[contig, region]
-
-    # query variant_stats array with TileDB
     vs_uri = os.path.join(tmp_path, "dataset", "variant_stats")
-
-    contig = "1"
-    region = slice(12140)
-    with pytest.raises(Exception):
-        with tiledb.open(vs_uri) as A:
-            df = A.query(attrs=["allele", "ac"], dims=["pos"]).df[contig, region]
+    assert not os.path.exists(vs_uri)

 def test_ingestion_tasks(tmp_path):
gspowley commented 3 months ago

The next failure is also related to using tiledb-py. There must be an unexpected interaction somewhere.

jdblischak commented 3 months ago

The next failure is also related to using tiledb-py. There must be an unexpected interaction somewhere.

Doesn't that suggest there is a compatibility problem between the release version of tiledb-py and the dev release of libtiledb?

Also, I've started getting errors about missing the package dask-expr when importing tiledbvcf-py. It's bizarre though because only some the runs failed. Most bizarre is the macOS workflow. Both the python-standalone job and the Python+external libtiledbvcf jobs installed dask 2024.3.0, but the former failed due to missing dask-expr while the latter passed.

https://github.com/TileDB-Inc/TileDB-VCF/actions/runs/8250806005/job/22566300642?pr=669#step:4:13 https://github.com/TileDB-Inc/TileDB-VCF/actions/runs/8250806005/job/22567028956?pr=669#step:5:14

gspowley commented 3 months ago

Doesn't that suggest there is a compatibility problem between the release version of tiledb-py and the dev release of libtiledb?

Yes, I believe so. I couldn't explain why the regular nightly is passing, but the regular nightly may be configured differently.

jdblischak commented 3 months ago

Quick update. I fixed the dask import errors in #673. However, the linux-libtiledb-dev build is still segfault'ing during the tiledbvcf-py tests, just as before. Also, a new error while running the libtiledbvcf unit tests on macOS.

-------------------------------------------------------------------------------
TileDB-VCF: Test Resume Ingest and Export With Contig Merge
-------------------------------------------------------------------------------
/Users/runner/work/TileDB-VCF/TileDB-VCF/libtiledbvcf/test/src/unit-vcf-export.cc:1744
...............................................................................

/Users/runner/work/TileDB-VCF/TileDB-VCF/libtiledbvcf/test/src/unit-vcf-export.cc:1744: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  Error loading metadata; 'version' field has invalid value.

===============================================================================
test cases:   78 |   77 passed | 1 failed
assertions: 6790 | 6789 passed | 1 failed

This PR doesn't touch the libtiledbvcf source code, so I'm hoping this is spurious. I restarted it

jdblischak commented 3 months ago

This PR doesn't touch the libtiledbvcf source code, so I'm hoping this is spurious. I restarted it

Confirmed. It was spurious

jdblischak commented 3 months ago

Still failing the tiledbvcf-py tests:

File "/home/runner/work/TileDB-VCF/TileDB-VCF/TileDB-VCF/apis/python/tests/test_tiledbvcf.py", line 40 in check_if_compatible
File "/home/runner/work/TileDB-VCF/TileDB-VCF/TileDB-VCF/apis/python/tests/test_tiledbvcf.py", line 992 in test_ingestion_tasks

https://github.com/TileDB-Inc/TileDB-VCF/blob/28c5e5d7e920b83b24759915a739e683fc685fc7/apis/python/tests/test_tiledbvcf.py#L38-L40 https://github.com/TileDB-Inc/TileDB-VCF/blob/28c5e5d7e920b83b24759915a739e683fc685fc7/apis/python/tests/test_tiledbvcf.py#L1003

Currently tiledb-py 0.26.0 is getting installed because the conda env is cached. My next idea is to delete the cache to install a more recent tiledb-py

jdblischak commented 3 months ago

Note that the nightly linux-libtiledb-dev failed on my fork because I deleted the GitHub Actions cache

https://github.com/jdblischak/TileDB-VCF/actions/runs/8353245587/job/22864662362

In other words, I think the only reason it is passing on the main repo is because the conda env is cached

jdblischak commented 3 months ago

Rebased onto #679

jdblischak commented 3 months ago

I fixed the segfault (caused by running tiledb.open() with tiledb-py 0.26.0 and libtiledb 2.22 from its dev branch) by building TileDB-Py from source (#679).

To minimize this PR as much as possible (since we can remove pyarrow-hotfix once we can install pyarrow 14+ in TileDB Cloud), I only import pyarrow-hotfix once during the initialization of pyarrow when tiledbvcf-py is imported. I assume this is sufficient to apply the hotfix.

jdblischak commented 3 months ago

No Azure build was triggered for my last commit. Closing and reopening

jdblischak commented 3 months ago

Reopening successfully triggered a new Azure build https://dev.azure.com/TileDB-Inc/CI/_build/results?buildId=38579&view=results