aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
56 stars 10 forks source link

Ray processes failed to startup during `run_cgs_models` #71

Open ehcilc opened 1 year ago

ehcilc commented 1 year ago

Describe the bug When I run pycistopic models, it turns out that the Ray processes failed to startup.

To Reproduce

ray.shutdown()
models = run_cgs_models(cistopic_obj, 
                       n_topics = [5,10,15,20,25,30,35],
                       n_cpu = 200,
                       n_iter = 200,
                       random_state = 666,
                       alpha = 50,
                       alpha_by_topic = True,
                       eta = 0.1,
                       eta_by_topic = False,
                       save_path = None,
                       _temp_dir = tmpDir)

Error output

TimeoutError                              Traceback (most recent call last)
File ~/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/node.py:312, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    311 try:
--> 312     ray._private.services.wait_for_node(
    313         self.redis_address,
    314         self.gcs_address,
    315         self._plasma_store_socket_name,
    316         self.redis_password,
    317     )
    318 except TimeoutError:

File ~/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/services.py:385, in wait_for_node(redis_address, gcs_address, node_plasma_store_socket_name, redis_password, timeout)
    384         time.sleep(0.1)
--> 385 raise TimeoutError("Timed out while waiting for node to startup.")

TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
Cell In[17], line 2
      1 ray.shutdown()
----> 2 models = run_cgs_models(cistopic_obj, 
      3                        n_topics = [5,10,15,20,25,30,35],
      4                        n_cpu = 20,
      5                        n_iter = 200,
      6                        random_state = 666,
      7                        alpha = 50,
      8                        alpha_by_topic = True,
      9                        eta = 0.1,
     10                        eta_by_topic = False,
     11                        save_path = None,
     12                        _temp_dir = tmpDir)

File ~/Desktop/WHY/softwares/scenicplus_series/pycisTopic/pycisTopic/lda_models.py:156, in run_cgs_models(cistopic_obj, n_topics, n_cpu, n_iter, random_state, alpha, alpha_by_topic, eta, eta_by_topic, top_topics_coh, save_path, **kwargs)
    154 region_names = cistopic_obj.region_names
    155 cell_names = cistopic_obj.cell_names
--> 156 ray.init(num_cpus=n_cpu, **kwargs)
    157 model_list = ray.get(
    158     [
    159         run_cgs_model.remote(
   (...)
    174     ]
    175 )
    176 ray.shutdown()

File ~/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/worker.py:1420, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
   1378     ray_params = ray._private.parameter.RayParams(
   1379         node_ip_address=node_ip_address,
   1380         raylet_ip_address=raylet_ip_address,
   (...)
   1414         node_name=_node_name,
   1415     )
   1416     # Start the Ray processes. We set shutdown_at_exit=False because we
   1417     # shutdown the node in the ray.shutdown call that happens in the atexit
   1418     # handler. We still spawn a reaper process in case the atexit handler
   1419     # isn't called.
-> 1420     _global_node = ray._private.node.Node(
   1421         head=True, shutdown_at_exit=False, spawn_reaper=True, ray_params=ray_params
   1422     )
   1423 else:
   1424     # In this case, we are connecting to an existing cluster.
   1425     if num_cpus is not None or num_gpus is not None:

File ~/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/node.py:319, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    312     ray._private.services.wait_for_node(
    313         self.redis_address,
    314         self.gcs_address,
    315         self._plasma_store_socket_name,
    316         self.redis_password,
    317     )
    318 except TimeoutError:
--> 319     raise Exception(
    320         "The current node has not been updated within 30 "
    321         "seconds, this could happen because of some of "
    322         "the Ray processes failed to startup."
    323     )
    324 node_info = ray._private.services.get_node_to_connect_for_driver(
    325     self.redis_address,
    326     self.gcs_address,
    327     self._raylet_ip_address,
    328     redis_password=self.redis_password,
    329 )
    330 if self._ray_params.node_manager_port == 0:

Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

Version (please complete the following information):

Additional context I searched before asking. However, there's no clear solution to this issue. Thanks for your help in advance!

SeppeDeWinter commented 1 year ago

This issue seems similar to this one: https://github.com/aertslab/pycisTopic/issues/80

can you try the steps that I explained on that issue?

Best,

Seppe