OmicsML / dance

DANCE: a deep learning library and benchmark platform for single-cell analysis
https://pydance.readthedocs.io
BSD 2-Clause "Simplified" License
348 stars 34 forks source link

Random `segfault` when running pipeline tuning with `wandb` on certain machines #407

Open RemyLau opened 7 months ago

RemyLau commented 7 months ago

When running the tuning examples recently introduced in #398 (and #406), there appears to be some random chance of having segfault. The issue was later observed to be machine specific. I have only been getting this random segfault on MSU ICER HPCC (Python 3.8.16). Running the same example script on papermachine does not throw this segfault.

Looking at the core dump file (using pystack), it appears that the issue was related to Python's threading. More particularly when calling sklearn's randomized svd func (maybe some other similar packages as well). See detail core dump log below.

(dance) bash-4.2$ pystack core core.113134
Using executable found in the core file: /mnt/home/liurenmi/software/anaconda3/envs/dance/bin/python

Core file information:
state: D zombie: True niceness: 0
pid: 113134 ppid: 112816 sid: 112816
uid: 790872 gid: 2362 pgrp: 113134
executable: python arguments: python main.py

The process died due a segmentation fault accessing address: 0xffffffffffffff70
Traceback for thread 114928 [] (most recent call last):
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
        self._loop_check_status(
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
        join_requested = self._join_event.wait(timeout=wait_time)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
        signaled = self._cond.wait(timeout)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
        gotit = waiter.acquire(True, timeout)

Traceback for thread 114927 [] (most recent call last):
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
        self._loop_check_status(
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
        join_requested = self._join_event.wait(timeout=wait_time)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
        signaled = self._cond.wait(timeout)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
        gotit = waiter.acquire(True, timeout)

Traceback for thread 114926 [] (most recent call last):
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
        self._loop_check_status(
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
        join_requested = self._join_event.wait(timeout=wait_time)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
        signaled = self._cond.wait(timeout)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
        gotit = waiter.acquire(True, timeout)

Traceback for thread 114874 [] (most recent call last):
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
        msg = self._read_message()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
        resp = self._sock_client.read_server_response(timeout=1)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 285, in read_server_response
        data = self._read_packet_bytes(timeout=timeout)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 255, in _read_packet_bytes
        data = self._sock.recv(self._bufsize)

Traceback for thread 114845 [Has the GIL] (most recent call last):
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 298, in _run_job
        self._function()
    (Python) File "main.py", line 83, in evaluate_pipeline
        preprocessing_pipeline(data)
    (Python) File "/mnt/ufs18/home-026/liurenmi/repo/dance/dance/pipeline.py", line 238, in __call__
        func(*args, **kwargs)
    (Python) File "/mnt/ufs18/home-026/liurenmi/repo/dance/dance/transforms/cell_feature.py", line 56, in __call__
        gene_feat = gene_pca.fit_transform(feat.T)  # decompose into gene features
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
        data_to_wrap = f(self, X, *args, **kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/base.py", line 1152, in wrapper
        return fit_method(estimator, *args, **kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 460, in fit_transform
        U, S, Vt = self._fit(X)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 512, in _fit
        return self._fit_truncated(X, n_components, self._fit_svd_solver)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 616, in _fit_truncated
        U, S, Vt = randomized_svd(
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 449, in randomized_svd
        Q = randomized_range_finder(
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 277, in randomized_range_finder
        Q, _ = linalg.lu(safe_sparse_dot(A, Q), permute_l=True)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/scipy/linalg/_decomp_lu.py", line 220, in lu
        p, l, u, info = flu(a1, permute_l=permute_l, overwrite_a=overwrite_a)

Traceback for thread 114844 [] (most recent call last):
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 178, in _heartbeat
        time.sleep(5)

Traceback for thread 113134 [] (most recent call last):
    (Python) File "main.py", line 108, in <module>
        wandb.agent(sweep_id, function=evaluate_pipeline, count=3)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/wandb_agent.py", line 581, in agent
        return pyagent(sweep_id, function, entity, project, count)
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 348, in pyagent
        agent.run()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 326, in run
        self._run_jobs_from_queue()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 220, in _run_jobs_from_queue
        thread.join()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 1011, in join
        self._wait_for_tstate_lock()
    (Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
        elif lock.acquire(block, timeout):

More sysinfo below.

Machine that failed:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Machine that did not fail:

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Skipping for now but might come back later to fix this issue if it appears to be happening to more users other than myself.