Open cchanzl opened 3 years ago
The line you mention should not cause a problem - it's just multiplying a fairly small tensor with an integer. I just ran it on a fresh install and it ran as expected on a mac.
I will run on a Ubuntu machine to confirm.
In the meantime, can you please
Can you also try printing out the parameters to agg_param
(key, state_dicts, update_sizes
) in the first line of agg_params
when key == 'fc1.weight'
(this corresponds to the shape that you mention) - would be useful to see what's actually getting passed to it.
Also, I noticed you ran with with digit-class 0 on multiple workers - ideally you should run with --digit-class 0 and 1 and 2 for the three different workers - this results in using different datasets for the different workers that cover all the digits - otherwise the server side test performance will forever remain suboptimal.
Thanks for your quick reply.
Terminal output from server.txt
Thanks for your terminal output.
I noticed you are using python 3.8 - the library and its dependencies in requirements.txt has so far been tested only with 3.7. It appears that there may be compatibility issue with python 3.8 and torch 1.4.0 and torchvision 0.5.0 (which are the pinned versions in dc-federated):
Can you try recreating the virtual environment with python 3.7?
Or alternatively you might try installing the torch versions compatible with python 3.8 (in the above link) in the venv - however, I can't guarantee that would work.
Thanks for the spot. I have now used python 3.7. The issue was related to pip not using shims correctly as per the issue in the link below. That is now fixed.
https://github.com/pyenv/pyenv/issues/1342
However, the issue persist and I am still unable to get pass "Updating the global model".
An updated terminal output is attached.
Thank you so much for your help. New terminal output.txt
I can now reproduce the error on Ubuntu 20.04. My current assessment is that the error is being caused by an incompatibility between the gevent
library (used to support multiple requests in parallel) and pytorch's multiprocessing capability on Ubuntu. gevent
modifies ('monkey-patches') some standard libraries at run-time to get really good performance - it seems this modification does not quite work with pytorch's own multi-processing facilities. I am looking into potential work-arounds.
Thanks for your help. Hope to hear from you soon. In the meantime, is there an alternative OS that you would recommend in the meantime to get this up and running?
It should works as is on Mac/OSX.
@cchanzl Ok so I have found a workaround - I will create a branch and post the update soon. I have also created a minimal example showing where the problem comes from (and the work-around) and created bug reports in both pytorch issues and gunicorn issues. Hopefully, we will be able to get to the bottom of this. This is a really weird problem.
Hi @hassan-digicatapult , thanks for the support.
I look forward to the workaround and to implement dc-federated's framework soon.
@cchanzl I have pushed the fix in this branch
I've tested this on my local Ubuntu 18.04 and AWS EC2 18.04 machines and works in both - it will be interesting to see if this works for you in 20.04.
The changes are minimal - however I would like to understand the issue better before I make a PR into develop. The discussion is ongoing in the pytorch issue I linked to above
Thanks @hassan-digicatapult I confirm that the MNIST test example works on my end based on the branch you created
Great @cchanzl ! I'll keep this issue open for now to see if we can resolve the underlying issue in a more satisfying manner in the pytorch issues.
@hassan-digicatapult @digigarlab @jc-digicatapult @nathandigicatapult
Description
Global model does not stop updating when running MNIST FedAvg example.
Steps to Reproduce
Expected behaviour:
Global model should be updated and its change be notified to each worker
Actual behaviour:
Nothing happens, process is stuck at "Updating the global model."
Reproduces how often:
100%
System: I am running wsl ubuntu 20.04 and it appears that this might be the reason why it is so slow particulart at the line "agg_val = state_dicts[0][key] update_sizes[0]" in fed_avg_server.py for torch.size([128, 9216]) 640