When running the tuning examples recently introduced in #398 (and #406), there appears to be some random chance of having segfault. The issue was later observed to be machine specific. I have only been getting this random segfault on MSU ICER HPCC (Python 3.8.16). Running the same example script on papermachine does not throw this segfault.
Looking at the core dump file (using pystack), it appears that the issue was related to Python's threading. More particularly when calling sklearn's randomized svd func (maybe some other similar packages as well). See detail core dump log below.
(dance) bash-4.2$ pystack core core.113134
Using executable found in the core file: /mnt/home/liurenmi/software/anaconda3/envs/dance/bin/python
Core file information:
state: D zombie: True niceness: 0
pid: 113134 ppid: 112816 sid: 112816
uid: 790872 gid: 2362 pgrp: 113134
executable: python arguments: python main.py
The process died due a segmentation fault accessing address: 0xffffffffffffff70
Traceback for thread 114928 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
self._loop_check_status(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
join_requested = self._join_event.wait(timeout=wait_time)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
Traceback for thread 114927 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
self._loop_check_status(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
join_requested = self._join_event.wait(timeout=wait_time)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
Traceback for thread 114926 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
self._loop_check_status(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
join_requested = self._join_event.wait(timeout=wait_time)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
Traceback for thread 114874 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
msg = self._read_message()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
resp = self._sock_client.read_server_response(timeout=1)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 285, in read_server_response
data = self._read_packet_bytes(timeout=timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 255, in _read_packet_bytes
data = self._sock.recv(self._bufsize)
Traceback for thread 114845 [Has the GIL] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 298, in _run_job
self._function()
(Python) File "main.py", line 83, in evaluate_pipeline
preprocessing_pipeline(data)
(Python) File "/mnt/ufs18/home-026/liurenmi/repo/dance/dance/pipeline.py", line 238, in __call__
func(*args, **kwargs)
(Python) File "/mnt/ufs18/home-026/liurenmi/repo/dance/dance/transforms/cell_feature.py", line 56, in __call__
gene_feat = gene_pca.fit_transform(feat.T) # decompose into gene features
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/base.py", line 1152, in wrapper
return fit_method(estimator, *args, **kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 460, in fit_transform
U, S, Vt = self._fit(X)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 512, in _fit
return self._fit_truncated(X, n_components, self._fit_svd_solver)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 616, in _fit_truncated
U, S, Vt = randomized_svd(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 449, in randomized_svd
Q = randomized_range_finder(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 277, in randomized_range_finder
Q, _ = linalg.lu(safe_sparse_dot(A, Q), permute_l=True)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/scipy/linalg/_decomp_lu.py", line 220, in lu
p, l, u, info = flu(a1, permute_l=permute_l, overwrite_a=overwrite_a)
Traceback for thread 114844 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 178, in _heartbeat
time.sleep(5)
Traceback for thread 113134 [] (most recent call last):
(Python) File "main.py", line 108, in <module>
wandb.agent(sweep_id, function=evaluate_pipeline, count=3)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/wandb_agent.py", line 581, in agent
return pyagent(sweep_id, function, entity, project, count)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 348, in pyagent
agent.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 326, in run
self._run_jobs_from_queue()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 220, in _run_jobs_from_queue
thread.join()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
When running the tuning examples recently introduced in #398 (and #406), there appears to be some random chance of having
segfault
. The issue was later observed to be machine specific. I have only been getting this randomsegfault
on MSU ICER HPCC (Python 3.8.16). Running the same example script onpapermachine
does not throw thissegfault
.Looking at the core dump file (using pystack), it appears that the issue was related to Python's
threading
. More particularly when calling sklearn's randomized svd func (maybe some other similar packages as well). See detail core dump log below.More sysinfo below.
Machine that failed:
Machine that did not fail:
Skipping for now but might come back later to fix this issue if it appears to be happening to more users other than myself.