Open gubertoli opened 2 years ago
I did tests with flwr 0.19.0. If trying to execute two consecutive simulations (fl.simulation.start_simulation), similar error (exhausting computational resources) arise:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [25], in <cell line: 13>()
4 strategy=fl.server.strategy.FedAvg(
5 fraction_fit=0.5, # Sample 50% of available clients for training
6 fraction_eval=0.1, # Sample 10% of available clients for evaluation
(...)
9 min_available_clients=int(len(clients_list_wo) * 0.75), # Wait until at least 75% clients are available
10 )
12 # Start simulation
---> 13 fl.simulation.start_simulation(
14 client_fn=client_fn,
15 num_clients=len(clients_list_wo),
16 clients_ids=clients_list_wo,
17 num_rounds=NUM_ROUNDS,
18 strategy=strategy,
19 )
File ~/fl/venv/lib/python3.9/site-packages/flwr/simulation/app.py:154, in start_simulation(client_fn, num_clients, clients_ids, client_resources, num_rounds, strategy, client_manager, ray_init_args, keep_initialised)
151 ray.shutdown()
153 # Initialize Ray
--> 154 ray.init(**ray_init_args)
155 log(
156 INFO,
157 "Ray initialized with resources: %s",
158 ray.cluster_resources(),
159 )
161 # Initialize server and server config
File ~/fl/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
File ~/fl/venv/lib/python3.9/site-packages/ray/worker.py:933, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, **kwargs)
895 ray_params = ray._private.parameter.RayParams(
896 node_ip_address=node_ip_address,
897 raylet_ip_address=raylet_ip_address,
(...)
927 metrics_export_port=_metrics_export_port,
928 tracing_startup_hook=_tracing_startup_hook)
929 # Start the Ray processes. We set shutdown_at_exit=False because we
930 # shutdown the node in the ray.shutdown call that happens in the atexit
931 # handler. We still spawn a reaper process in case the atexit handler
932 # isn't called.
--> 933 _global_node = ray.node.Node(
934 head=True,
935 shutdown_at_exit=False,
936 spawn_reaper=True,
937 ray_params=ray_params)
938 else:
939 # In this case, we are connecting to an existing cluster.
940 if num_cpus is not None or num_gpus is not None:
File ~/fl/venv/lib/python3.9/site-packages/ray/node.py:274, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
268 self.get_gcs_client().internal_kv_put(
269 b"tracing_startup_hook",
270 ray_params.tracing_startup_hook.encode(), True,
271 ray_constants.KV_NAMESPACE_TRACING)
273 if not connect_only:
--> 274 self.start_ray_processes()
275 # we should update the address info after the node has been started
276 try:
File ~/fl/venv/lib/python3.9/site-packages/ray/node.py:1114, in Node.start_ray_processes(self)
1110 # Make sure we don't call `determine_plasma_store_config` multiple
1111 # times to avoid printing multiple warnings.
1112 resource_spec = self.get_resource_spec()
1113 plasma_directory, object_store_memory = \
-> 1114 ray._private.services.determine_plasma_store_config(
1115 resource_spec.object_store_memory,
1116 plasma_directory=self._ray_params.plasma_directory,
1117 huge_pages=self._ray_params.huge_pages
1118 )
1119 self.start_raylet(plasma_directory, object_store_memory)
1120 if self._ray_params.include_log_monitor:
File ~/fl/venv/lib/python3.9/site-packages/ray/_private/services.py:1899, in determine_plasma_store_config(object_store_memory, plasma_directory, huge_pages)
1895 plasma_directory = "/dev/shm"
1896 elif (not os.environ.get("RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE")
1897 and object_store_memory >
1898 ray_constants.REQUIRE_SHM_SIZE_THRESHOLD):
-> 1899 raise ValueError(
1900 "The configured object store size ({} GB) exceeds "
1901 "/dev/shm size ({} GB). This will harm performance. "
1902 "Consider deleting files in /dev/shm or increasing its "
1903 "size with "
1904 "--shm-size in Docker. To ignore this warning, "
1905 "set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.".format(
1906 object_store_memory / 1e9, shm_avail / 1e9))
1907 else:
1908 plasma_directory = ray._private.utils.get_user_temp_dir()
ValueError: The configured object store size (30.562349056 GB) exceeds /dev/shm size (30.562349056 GB). This will harm performance. Consider deleting files in /dev/shm or increasing its size with --shm-size in Docker. To ignore this warning, set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.
It works if after first simulation I restart the notebook kernel and run the second simulation.
this is because two independent simulations aren't aware of the other simulation running... to fix this you need to pass the resources you want to use when calling ray.init()
(which by default "sees" all the resources in your system). Alternatively, if you mostly care about GPU running out of memory, you could set CUDA_VISIBLE_DEVICES
to be different sets of GPUs.
For example:
CUDA_VISIBLE_DEVICES="0,1,2,3" python <your_experiment>.py # first simulation
CUDA_VISIBLE_DEVICES="4,5,6,7" python <your_experiment>.py # second simulation
This will run two simulations in parallel just fine...
Could you give this a go @gubertoli @mofanv
@jafermarq I am not running two independent simulation simultaneously, but running one and once it stops trying to run another one. Also, my configuration do not have GPUs. Do you think your suggestion still applies for my case ?
Describe the bug
I have been working with the simulation package
fl.simulation.start_simulation
, and after tweaking some parameters and re-running the code it seems (by the error message) to exhaust the computational resources. Probably some kind of garbage collector required ?During the try-outs, it seems that after a while it works again (probably some gargabe collection from the OS?)
Steps/Code to Reproduce
I am using as reference the sim.ipynb notebook with a different dataset (bigger than MNIST) and different DNN model.
Expected Results
Output of INFO, DEBUG, launch_and_fit, launch_and_evaluate messages.
Actual Results