adap / flower

Flower: A Friendly Federated Learning Framework
https://flower.ai
Apache License 2.0
4.76k stars 826 forks source link

RuntimeError: Simulation crashed. #2758

Open buaaYYC opened 8 months ago

buaaYYC commented 8 months ago

What is your question?

With a dictionary, you tell Flower's VirtualClientEngine that each

client needs exclusive access to these many resources in order to run

client_resources = {"num_cpus": 1, "num_gpus": 0.0}

Let's disable tqdm progress bar in the main thread (used by the server)

disable_progress_bar() 运行下面代码 history = fl.simulation.start_simulation( client_fn=client_fn_callback, # a callback to construct a client num_clients=NUM_CLIENTS, # total number of clients in the experiment config=fl.server.ServerConfig(num_rounds=10), # let's run for 10 rounds strategy=strategy, # the strategy that will orchestrate the whole FL pipeline client_resources=client_resources, actor_kwargs={ "on_actor_init_fn": disable_progress_bar # disable tqdm on each actor/process spawning virtual clients }, ) 产生bug INFO flwr 2023-12-26 17:11:14,661 | app.py:178 | Starting Flower simulation, config: ServerConfig(num_rounds=10, round_timeout=None) 2023-12-26 17:11:17,056 WARNING utils.py:585 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2023-12-26 17:11:18,167 INFO worker.py:1621 -- Started a local Ray instance. INFO flwr 2023-12-26 17:11:19,197 | app.py:213 | Flower VCE: Ray initialized with resources: {'CPU': 12.0, 'node:__internal_head__': 1.0, 'accelerator_type:G': 1.0, 'GPU': 1.0, 'object_store_memory': 27794835456.0, 'node:172.17.0.5': 1.0, 'memory': 55589670912.0} INFO flwr 2023-12-26 17:11:19,199 | app.py:219 | Optimize your simulation with Flower VCE: https://flower.dev/docs/framework/how-to-run-simulations.html INFO flwr 2023-12-26 17:11:19,200 | app.py:242 | Flower VCE: Resources for each Virtual Client: {'num_cpus': 1, 'num_gpus': 0.0} INFO flwr 2023-12-26 17:11:19,268 | app.py:288 | Flower VCE: Creating VirtualClientEngineActorPool with 12 actors INFO flwr 2023-12-26 17:11:19,270 | server.py:89 | Initializing global parameters INFO flwr 2023-12-26 17:11:19,272 | server.py:276 | Requesting initial parameters from one random client ERROR flwr 2023-12-26 17:11:24,866 | ray_client_proxy.py:145 | Traceback (most recent call last): File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 138, in _submit_job res = self.actor_pool.get_client_result(self.cid, timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 414, in get_client_result return self._fetch_future_result(cid) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 300, in _fetch_future_result res_cid, res = ray.get(future) # type: (str, ClientRes) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, *kwargs) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, **kwargs) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/worker.py", line 2524, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ClientException): ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'

The above exception was the direct cause of the following exception:

ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:

A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: \'RandomSampler\' object has no attribute \'shard\'\n',)

ERROR flwr 2023-12-26 17:11:24,868 | ray_client_proxy.py:146 | ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'

The above exception was the direct cause of the following exception:

ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:

A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: \'RandomSampler\' object has no attribute \'shard\'\n',) ERROR flwr 2023-12-26 17:11:24,869 | app.py:313 | ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'

The above exception was the direct cause of the following exception:

ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:

A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: \'RandomSampler\' object has no attribute \'shard\'\n',) ERROR flwr 2023-12-26 17:11:24,872 | app.py:314 | Traceback (most recent call last): File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/app.py", line 308, in start_simulation hist = run_fl( File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/app.py", line 225, in run_fl hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py", line 90, in fit self.parameters = self._get_initial_parameters(timeout=timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py", line 279, in _get_initial_parameters get_parameters_res = random_client.get_parameters(ins=ins, timeout=timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 180, in get_parameters res = self._submit_job(get_parameters, timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 147, in _submit_job raise ex File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 138, in _submit_job res = self.actor_pool.get_client_result(self.cid, timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 414, in get_client_result return self._fetch_future_result(cid) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 300, in _fetch_future_result res_cid, res = ray.get(future) # type: (str, ClientRes) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, *kwargs) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, **kwargs) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/worker.py", line 2524, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ClientException): ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'

The above exception was the direct cause of the following exception:

ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:

A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: \'RandomSampler\' object has no attribute \'shard\'\n',)

ERROR flwr 2023-12-26 17:11:24,873 | app.py:315 | Your simulation crashed :(. This could be because of several reasons.The most common are:

Your system couldn't fit a single VirtualClient: try lowering client_resources. All the actors in your pool crashed. This could be because:

  • You clients hit an out-of-memory (OOM) error and actors couldn't recover from it. Try launching your simulation with more generous client_resources setting (i.e. it seems {'num_cpus': 1, 'num_gpus': 0.0} is not enough for your workload). Use fewer concurrent actors.
  • You were running a multi-node simulation and all worker nodes disconnected. The head node might still be alive but cannot accommodate any actor with resources: {'num_cpus': 1, 'num_gpus': 0.0}.

    RayTaskError(ClientException) Traceback (most recent call last) File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/app.py:308, in start_simulation(client_fn, num_clients, clients_ids, client_resources, server, config, strategy, client_manager, ray_init_args, keep_initialised, actor_type, actor_kwargs, actor_scheduling) 306 try: 307 # Start training --> 308 hist = run_fl( 309 server=initialized_server, 310 config=initialized_config, 311 ) 312 except Exception as ex:

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/app.py:225, in run_fl(server, config) 224 """Train a model on the given server and return the History object.""" --> 225 hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout) 226 log(INFO, "app_fit: losses_distributed %s", str(hist.losses_distributed))

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py:90, in Server.fit(self, num_rounds, timeout) 89 log(INFO, "Initializing global parameters") ---> 90 self.parameters = self._get_initial_parameters(timeout=timeout) 91 log(INFO, "Evaluating initial parameters")

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py:279, in Server._get_initial_parameters(self, timeout) 278 ins = GetParametersIns(config={}) --> 279 get_parameters_res = random_client.get_parameters(ins=ins, timeout=timeout) 280 log(INFO, "Received initial parameters from one random client")

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py:180, in RayActorClientProxy.get_parameters(self, ins, timeout) 175 return maybe_call_get_parameters( 176 client=client, 177 get_parameters_ins=ins, 178 ) --> 180 res = self._submit_job(get_parameters, timeout) 182 return cast( 183 common.GetParametersRes, 184 res, 185 )

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py:147, in RayActorClientProxy._submit_job(self, job_fn, timeout) 146 log(ERROR, ex) --> 147 raise ex 149 return res

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py:138, in RayActorClientProxy._submit_job(self, job_fn, timeout) 134 self.actor_pool.submit_client_job( 135 lambda a, c_fn, j_fn, cid: a.run.remote(c_fn, j_fn, cid), 136 (self.client_fn, job_fn, self.cid), 137 ) --> 138 res = self.actor_pool.get_client_result(self.cid, timeout) 140 except Exception as ex:

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py:414, in VirtualClientEngineActorPool.get_client_result(self, cid, timeout) 413 # Fetch result belonging to the VirtualClient calling this method --> 414 return self._fetch_future_result(cid)

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py:300, in VirtualClientEngineActorPool._fetch_future_result(self, cid) 299 future: ObjectRef[Any] = self._cid_to_future[cid]["future"] # type: ignore --> 300 res_cid, res = ray.get(future) # type: (str, ClientRes) 301 except ray.exceptions.RayActorError as ex:

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init..auto_init_wrapper(*args, *kwargs) 23 auto_init_ray() ---> 24 return fn(args, **kwargs)

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook..wrapper(*args, kwargs) 102 return getattr(ray, func.name)(*args, *kwargs) --> 103 return func(args, kwargs)

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/worker.py:2524, in get(object_refs, timeout) 2523 if isinstance(value, RayTaskError): -> 2524 raise value.as_instanceof_cause() 2525 else:

RayTaskError(ClientException): ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'

The above exception was the direct cause of the following exception:

ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:

A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: \'RandomSampler\' object has no attribute \'shard\'\n',)

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last) Cell In[20], line 8 5 # Let's disable tqdm progress bar in the main thread (used by the server) 6 disable_progress_bar() ----> 8 history = fl.simulation.start_simulation( 9 client_fn=client_fn_callback, # a callback to construct a client 10 num_clients=NUM_CLIENTS, # total number of clients in the experiment 11 config=fl.server.ServerConfig(num_rounds=10), # let's run for 10 rounds 12 strategy=strategy, # the strategy that will orchestrate the whole FL pipeline 13 client_resources=client_resources, 14 actor_kwargs={ 15 "on_actor_init_fn": disable_progress_bar # disable tqdm on each actor/process spawning virtual clients 16 }, 17 )

File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/app.py:332, in start_simulation(client_fn, num_clients, clients_ids, client_resources, server, config, strategy, client_manager, ray_init_args, keep_initialised, actor_type, actor_kwargs, actor_scheduling) 314 log(ERROR, traceback.format_exc()) 315 log( 316 ERROR, 317 "Your simulation crashed :(. This could be because of several reasons." (...) 330 client_resources, 331 ) --> 332 raise RuntimeError("Simulation crashed.") from ex 334 finally: 335 # Stop time monitoring resources in cluster 336 f_stop.set()

RuntimeError: Simulation crashed.

这个问题怎么解决。

yan-gao-GY commented 7 months ago

hi, it seems the issue comes from the data partitioning process. could you share the related code when using Flower Datasets?

weeebdev commented 6 months ago

hi, it seems the issue comes from the data partitioning process. could you share the related code when using Flower Datasets?

Same problem, could you help me?