Closed bkmartinjr closed 1 year ago
All api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py
unit tests passed on GHA for Python 3.9, Ubuntu latest (22.04).
Successfully installed MarkupSafe-2.1.3 aiobotocore-2.5.0 aiohttp-3.8.4 aioitertools-0.11.0 aiosignal-1.3.1 anndata-0.9.1 async-timeout-4.0.2 attrs-23.1.0 botocore-1.29.76 cellxgene-census-0.1.dev1+g61d9497 cmake-3.26.3 contourpy-1.0.7 cycler-0.11.0 filelock-3.12.0 fonttools-4.39.4 frozenlist-1.3.3 fsspec-2023.5.0 h5py-3.8.0 importlib-resources-5.12.0 jinja2-3.1.2 jmespath-1.0.1 joblib-1.2.0 kiwisolver-1.4.4 lit-16.0.5.post0 llvmlite-0.39.1 matplotlib-3.7.1 mpmath-1.3.0 multidict-6.0.4 natsort-8.3.1 networkx-3.1 numba-0.56.4 numpy-1.23.5 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-cupti-cu11-11.7.101 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.2.10.91 nvidia-cusolver-cu11-11.4.0.1 nvidia-cusparse-cu11-11.7.4.91 nvidia-nccl-cu11-2.14.3 nvidia-nvtx-cu11-11.7.91 pandas-2.0.2 patsy-0.5.3 pillow-9.5.0 pyarrow-12.0.0 pynndescent-0.5.10 pyparsing-3.0.9 python-dateutil-2.8.2 pytz-2023.3 s3fs-2023.5.0 scanpy-1.9.3 scikit-learn-1.2.2 scikit-misc-0.2.0 scipy-1.10.1 seaborn-0.12.2 session-info-1.0.0 somacore-1.0.3 statsmodels-0.14.0 stdlib_list-0.8.0 sympy-1.12 threadpoolctl-3.1.0 tiledb-0.21.4 tiledbsoma-1.2.5 torch-2.0.1 torchdata-0.6.1 tqdm-4.65.0 triton-2.0.0 tzdata-2023.3 umap-learn-0.5.3 urllib3-1.26.16 wrapt-1.15.0 yarl-1.9.2
I wonder if it is a package version issue? I can configure a venv if you want to provide a spec, and retry.
I don't consider this a fix, but perturbing how the ExperimentDataPipe object is initialized prior to multiprocessing changes the behavior. Calling dp.shape
, which transitively calls the same _init()
function as obs_encoders()
, allows the test to pass. Alternately, replacing with iter(dp)
, which also calls the same _init()
function, later causes the workers to segfault instead of hanging. In all cases, the code runs up until the DataLoader
spawns child processes that then attempt to use the serialized ExperimentDataPipe
object There be gremlins herein...
--- a/api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py
+++ b/api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py
@@ -364,7 +364,7 @@ def test_experiment_dataloader__multiprocess_pickling(soma_experiment: Experimen
obs_column_names=["label"],
)
dl = experiment_dataloader(dp, num_workers=2)
- dp.obs_encoders() # trigger query building
+ dp.shape # trigger query building
row = next(iter(dl)) # trigger multiprocessing
Describe the bug
the
test_experiment_dataloader__multiprocess_pickling
unit test will hang when run on Linux/Python 3.9. I have let it sit for 12+ hours with no change.After a keyboard interrupt (with --full-trace enabled):
Environment
From tiledbsoma.show_package_versions():