adap / flower

Flower: A Friendly Federated Learning Framework
https://flower.ai
Apache License 2.0
4.83k stars 833 forks source link

RuntimeError: could not create a primitive descriptor for a matmul primitive #3761

Open Klein920116 opened 1 month ago

Klein920116 commented 1 month ago

Describe the bug

When I run vertical-fl demo, show the error

Steps/Code to Reproduce

python simulation.py

Expected Results

get the vertical federated learning result.

Actual Results

INFO : Starting Flower simulation, config: num_rounds=1000, no round_timeout 2024-07-10 17:31:23,446 INFO worker.py:1752 -- Started a local Ray instance. INFO : Flower VCE: Ray initialized with resources: {'node:__internal_head__': 1.0, 'node:10.162.xx.xx': 1.0, 'CPU': 8.0, 'memory': 8653651968.0, 'object_store_memory': 4326825984.0} INFO : Optimize your simulation with Flower VCE: https://flower.ai/docs/framework/how-to-run-simulations.html INFO : No client_resources specified. Using minimal resources for clients. INFO : Flower VCE: Resources for each Virtual Client: {'num_cpus': 1, 'num_gpus': 0.0} INFO : Flower VCE: Creating VirtualClientEngineActorPool with 8 actors INFO : [INIT] INFO : Using initial global parameters provided by strategy INFO : Evaluating initial global parameters INFO :
INFO : [ROUND 1] INFO : configure_fit: strategy sampled 3 clients (out of 3) INFO : aggregate_fit: received 3 results and 0 failures ERROR : could not create a primitive descriptor for a matmul primitive

jafermarq commented 1 month ago

Hi @Klein920116 , I was able to run the example without an issue. I followed these steps:

  1. create a new virtual environment (i used python 3.10.12)
  2. activate environment, then do pip install -r requirements.txt
  3. Run the simulation: python simulation.py

Could you try those steps on your side? what platform are you using (i.e. MacOs, ubuntu, winodows)?

Klein920116 commented 1 month ago

Hi @jafermarq , Thanks for your response.

I ran the demo on ubuntu.

Howerver, I run the example follow your steps and then errors occur as well.

INFO : Flower VCE: Ray initialized with resources: {'CPU': 8.0, 'node:__internal_head__': 1.0, 'node:10.100.132.142': 1.0, 'memory': 9369661440.0, 'object_store_memory': 4684830720.0} INFO : Optimize your simulation with Flower VCE: https://flower.ai/docs/framework/how-to-run-simulations.html INFO : No client_resources specified. Using minimal resources for clients. INFO : Flower VCE: Resources for each Virtual Client: {'num_cpus': 1, 'num_gpus': 0.0} INFO : Flower VCE: Creating VirtualClientEngineActorPool with 8 actors INFO : [INIT] INFO : Using initial global parameters provided by strategy INFO : Evaluating initial global parameters INFO :
INFO : [ROUND 1] INFO : configure_fit: strategy sampled 3 clients (out of 3) INFO : aggregate_fit: received 3 results and 0 failures ERROR : could not create a primitive descriptor for a matmul primitive ERROR : Traceback (most recent call last): File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/simulation/app.py", line 323, in start_simulation hist = run_fl( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/server/server.py", line 490, in run_fl hist, elapsed_time = server.fit( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/server/server.py", line 113, in fit res_fit = self.fit_round( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/server/server.py", line 249, in fit_round ] = self.strategy.aggregate_fit(server_round, results, failures) File "/home/weikang/lwk/vertical-fl/strategy.py", line 76, in aggregate_fit output = self.model(embedding_server) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/weikang/lwk/vertical-fl/strategy.py", line 15, in forward x = self.fc(x) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: could not create a primitive descriptor for a matmul primitive

ERROR : Your simulation crashed :(. This could be because of several reasons. The most common are:

Sometimes, issues in the simulation code itself can cause crashes. It's always a good idea to double-check your code for any potential bugs or inconsistencies that might be contributing to the problem. For example:

  • You might be using a class attribute in your clients that hasn't been defined.
  • There could be an incorrect method call to a 3rd party library (e.g., PyTorch).
  • The return types of methods in your clients/strategies might be incorrect. Your system couldn't fit a single VirtualClient: try lowering client_resources. All the actors in your pool crashed. This could be because:
  • You clients hit an out-of-memory (OOM) error and actors couldn't recover from it. Try launching your simulation with more generous client_resources setting (i.e. it seems {'num_cpus': 1, 'num_gpus': 0.0} is not enough for your run). Use fewer concurrent actors.
  • You were running a multi-node simulation and all worker nodes disconnected. The head node might still be alive but cannot accommodate any actor with resources: {'num_cpus': 1, 'num_gpus': 0.0}. Take a look at the Flower simulation examples for guidance https://flower.ai/docs/framework/how-to-run-simulations.html. Traceback (most recent call last): File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/simulation/app.py", line 323, in start_simulation hist = run_fl( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/server/server.py", line 490, in run_fl hist, elapsed_time = server.fit( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/server/server.py", line 113, in fit res_fit = self.fit_round( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/server/server.py", line 249, in fit_round ] = self.strategy.aggregate_fit(server_round, results, failures) File "/home/weikang/lwk/vertical-fl/strategy.py", line 76, in aggregate_fit output = self.model(embedding_server) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/weikang/lwk/vertical-fl/strategy.py", line 15, in forward x = self.fc(x) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: could not create a primitive descriptor for a matmul primitive

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/weikang/lwk/vertical-fl/simulation.py", line 16, in hist = fl.simulation.start_simulation( File "/home/weikang/miniconda3/envs/newenv/lib/python3.10/site-packages/flwr/simulation/app.py", line 359, in start_simulation raise RuntimeError("Simulation crashed.") from ex RuntimeError: Simulation crashed.

Klein920116 commented 1 month ago

Does anyone answer the question? Thanks a lot.

jafermarq commented 1 month ago

@Klein920116 checking online for your error: RuntimeError: could not create a primitive descriptor for a matmul primitive suggests that it can be caused by lack of memory. From the logs it seems your machine has ~8GB of memory available for the simulation engine.

Klein920116 commented 1 month ago

@Klein920116 checking online for your error: RuntimeError: could not create a primitive descriptor for a matmul primitive suggests that it can be caused by lack of memory. From the logs it seems your machine has ~8GB of memory available for the simulation engine.

@jafermarq It doesn't work out although my machine has more than 60Gb of memory available.

Klein920116 commented 4 weeks ago

Could you please list all the configuration of vertical federated learning needed?

Thank you in advance. @jafermarq