adap / flower

Flower: A Friendly Federated AI Framework
https://flower.ai
Apache License 2.0
5.16k stars 884 forks source link

Exhausting computational resources with multiple fl.simulation.start_simulation runnings (ray) #1152

Open gubertoli opened 2 years ago

gubertoli commented 2 years ago

Describe the bug

I have been working with the simulation package fl.simulation.start_simulation, and after tweaking some parameters and re-running the code it seems (by the error message) to exhaust the computational resources. Probably some kind of garbage collector required ?

During the try-outs, it seems that after a while it works again (probably some gargabe collection from the OS?)

Steps/Code to Reproduce

I am using as reference the sim.ipynb notebook with a different dataset (bigger than MNIST) and different DNN model.

# Create FedAvg strategy
strategy=fl.server.strategy.FedAvg(
        fraction_fit=0.1,  # Sample 10% of available clients for training
        fraction_eval=0.05,  # Sample 5% of available clients for evaluation
        min_fit_clients=5,  # Never sample less than 10 clients for training
        min_eval_clients=5,  # Never sample less than 5 clients for evaluation
        min_available_clients=int(NUM_CLIENTS * 0.75),  # Wait until at least 75% clients are available
)

# Start simulation
fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    clients_ids=clients_list,
    num_rounds=5,
    strategy=strategy,
)

Expected Results

Output of INFO, DEBUG, launch_and_fit, launch_and_evaluate messages.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [78], in <cell line: 13>()
      4 strategy=fl.server.strategy.FedAvg(
      5         fraction_fit=0.1,  # Sample 10% of available clients for training
      6         fraction_eval=0.05,  # Sample 5% of available clients for evaluation
   (...)
      9         min_available_clients=int(NUM_CLIENTS * 0.75),  # Wait until at least 75% clients are available
     10 )
     12 # Start simulation
---> 13 fl.simulation.start_simulation(
     14     client_fn=client_fn,
     15     num_clients=NUM_CLIENTS,
     16     clients_ids=clients_list,
     17     num_rounds=5,
     18     strategy=strategy,
     19 )

File ~/fl/venv/lib/python3.9/site-packages/flwr/simulation/app.py:143, in start_simulation(client_fn, num_clients, clients_ids, client_resources, num_rounds, strategy, ray_init_args)
    140     ray.shutdown()
    142 # Initialize Ray
--> 143 ray.init(**ray_init_args)
    144 log(
    145     INFO,
    146     "Ray initialized with resources: %s",
    147     ray.cluster_resources(),
    148 )
    150 # Initialize server and server config

File ~/fl/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/fl/venv/lib/python3.9/site-packages/ray/worker.py:933, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, **kwargs)
    895     ray_params = ray._private.parameter.RayParams(
    896         node_ip_address=node_ip_address,
    897         raylet_ip_address=raylet_ip_address,
   (...)
    927         metrics_export_port=_metrics_export_port,
    928         tracing_startup_hook=_tracing_startup_hook)
    929     # Start the Ray processes. We set shutdown_at_exit=False because we
    930     # shutdown the node in the ray.shutdown call that happens in the atexit
    931     # handler. We still spawn a reaper process in case the atexit handler
    932     # isn't called.
--> 933     _global_node = ray.node.Node(
    934         head=True,
    935         shutdown_at_exit=False,
    936         spawn_reaper=True,
    937         ray_params=ray_params)
    938 else:
    939     # In this case, we are connecting to an existing cluster.
    940     if num_cpus is not None or num_gpus is not None:

File ~/fl/venv/lib/python3.9/site-packages/ray/node.py:274, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    268         self.get_gcs_client().internal_kv_put(
    269             b"tracing_startup_hook",
    270             ray_params.tracing_startup_hook.encode(), True,
    271             ray_constants.KV_NAMESPACE_TRACING)
    273 if not connect_only:
--> 274     self.start_ray_processes()
    275     # we should update the address info after the node has been started
    276     try:

File ~/fl/venv/lib/python3.9/site-packages/ray/node.py:1114, in Node.start_ray_processes(self)
   1110 # Make sure we don't call `determine_plasma_store_config` multiple
   1111 # times to avoid printing multiple warnings.
   1112 resource_spec = self.get_resource_spec()
   1113 plasma_directory, object_store_memory = \
-> 1114     ray._private.services.determine_plasma_store_config(
   1115         resource_spec.object_store_memory,
   1116         plasma_directory=self._ray_params.plasma_directory,
   1117         huge_pages=self._ray_params.huge_pages
   1118     )
   1119 self.start_raylet(plasma_directory, object_store_memory)
   1120 if self._ray_params.include_log_monitor:

File ~/fl/venv/lib/python3.9/site-packages/ray/_private/services.py:1899, in determine_plasma_store_config(object_store_memory, plasma_directory, huge_pages)
   1895     plasma_directory = "/dev/shm"
   1896 elif (not os.environ.get("RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE")
   1897       and object_store_memory >
   1898       ray_constants.REQUIRE_SHM_SIZE_THRESHOLD):
-> 1899     raise ValueError(
   1900         "The configured object store size ({} GB) exceeds "
   1901         "/dev/shm size ({} GB). This will harm performance. "
   1902         "Consider deleting files in /dev/shm or increasing its "
   1903         "size with "
   1904         "--shm-size in Docker. To ignore this warning, "
   1905         "set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.".format(
   1906             object_store_memory / 1e9, shm_avail / 1e9))
   1907 else:
   1908     plasma_directory = ray._private.utils.get_user_temp_dir()

ValueError: The configured object store size (33.928609792 GB) exceeds /dev/shm size (33.928609792 GB). This will harm performance. Consider deleting files in /dev/shm or increasing its size with --shm-size in Docker. To ignore this warning, set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.
gubertoli commented 2 years ago

I did tests with flwr 0.19.0. If trying to execute two consecutive simulations (fl.simulation.start_simulation), similar error (exhausting computational resources) arise:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [25], in <cell line: 13>()
      4 strategy=fl.server.strategy.FedAvg(
      5         fraction_fit=0.5,  # Sample 50% of available clients for training
      6         fraction_eval=0.1,  # Sample 10% of available clients for evaluation
   (...)
      9         min_available_clients=int(len(clients_list_wo) * 0.75),  # Wait until at least 75% clients are available
     10 )
     12 # Start simulation
---> 13 fl.simulation.start_simulation(
     14     client_fn=client_fn,
     15     num_clients=len(clients_list_wo),
     16     clients_ids=clients_list_wo,
     17     num_rounds=NUM_ROUNDS,
     18     strategy=strategy,
     19 )

File ~/fl/venv/lib/python3.9/site-packages/flwr/simulation/app.py:154, in start_simulation(client_fn, num_clients, clients_ids, client_resources, num_rounds, strategy, client_manager, ray_init_args, keep_initialised)
    151     ray.shutdown()
    153 # Initialize Ray
--> 154 ray.init(**ray_init_args)
    155 log(
    156     INFO,
    157     "Ray initialized with resources: %s",
    158     ray.cluster_resources(),
    159 )
    161 # Initialize server and server config

File ~/fl/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/fl/venv/lib/python3.9/site-packages/ray/worker.py:933, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, **kwargs)
    895     ray_params = ray._private.parameter.RayParams(
    896         node_ip_address=node_ip_address,
    897         raylet_ip_address=raylet_ip_address,
   (...)
    927         metrics_export_port=_metrics_export_port,
    928         tracing_startup_hook=_tracing_startup_hook)
    929     # Start the Ray processes. We set shutdown_at_exit=False because we
    930     # shutdown the node in the ray.shutdown call that happens in the atexit
    931     # handler. We still spawn a reaper process in case the atexit handler
    932     # isn't called.
--> 933     _global_node = ray.node.Node(
    934         head=True,
    935         shutdown_at_exit=False,
    936         spawn_reaper=True,
    937         ray_params=ray_params)
    938 else:
    939     # In this case, we are connecting to an existing cluster.
    940     if num_cpus is not None or num_gpus is not None:

File ~/fl/venv/lib/python3.9/site-packages/ray/node.py:274, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    268         self.get_gcs_client().internal_kv_put(
    269             b"tracing_startup_hook",
    270             ray_params.tracing_startup_hook.encode(), True,
    271             ray_constants.KV_NAMESPACE_TRACING)
    273 if not connect_only:
--> 274     self.start_ray_processes()
    275     # we should update the address info after the node has been started
    276     try:

File ~/fl/venv/lib/python3.9/site-packages/ray/node.py:1114, in Node.start_ray_processes(self)
   1110 # Make sure we don't call `determine_plasma_store_config` multiple
   1111 # times to avoid printing multiple warnings.
   1112 resource_spec = self.get_resource_spec()
   1113 plasma_directory, object_store_memory = \
-> 1114     ray._private.services.determine_plasma_store_config(
   1115         resource_spec.object_store_memory,
   1116         plasma_directory=self._ray_params.plasma_directory,
   1117         huge_pages=self._ray_params.huge_pages
   1118     )
   1119 self.start_raylet(plasma_directory, object_store_memory)
   1120 if self._ray_params.include_log_monitor:

File ~/fl/venv/lib/python3.9/site-packages/ray/_private/services.py:1899, in determine_plasma_store_config(object_store_memory, plasma_directory, huge_pages)
   1895     plasma_directory = "/dev/shm"
   1896 elif (not os.environ.get("RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE")
   1897       and object_store_memory >
   1898       ray_constants.REQUIRE_SHM_SIZE_THRESHOLD):
-> 1899     raise ValueError(
   1900         "The configured object store size ({} GB) exceeds "
   1901         "/dev/shm size ({} GB). This will harm performance. "
   1902         "Consider deleting files in /dev/shm or increasing its "
   1903         "size with "
   1904         "--shm-size in Docker. To ignore this warning, "
   1905         "set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.".format(
   1906             object_store_memory / 1e9, shm_avail / 1e9))
   1907 else:
   1908     plasma_directory = ray._private.utils.get_user_temp_dir()

ValueError: The configured object store size (30.562349056 GB) exceeds /dev/shm size (30.562349056 GB). This will harm performance. Consider deleting files in /dev/shm or increasing its size with --shm-size in Docker. To ignore this warning, set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.

It works if after first simulation I restart the notebook kernel and run the second simulation.

jafermarq commented 2 years ago

this is because two independent simulations aren't aware of the other simulation running... to fix this you need to pass the resources you want to use when calling ray.init() (which by default "sees" all the resources in your system). Alternatively, if you mostly care about GPU running out of memory, you could set CUDA_VISIBLE_DEVICES to be different sets of GPUs.

For example:

CUDA_VISIBLE_DEVICES="0,1,2,3" python <your_experiment>.py # first simulation
CUDA_VISIBLE_DEVICES="4,5,6,7" python <your_experiment>.py # second simulation

This will run two simulations in parallel just fine...

Could you give this a go @gubertoli @mofanv

gubertoli commented 2 years ago

@jafermarq I am not running two independent simulation simultaneously, but running one and once it stops trying to run another one. Also, my configuration do not have GPUs. Do you think your suggestion still applies for my case ?