digicatapult / dc-federated

A python library for federated learning supporting consortium scale deployment.
Apache License 2.0
8 stars 8 forks source link

Global model does not finish updating when running MNIST example #7

Open cchanzl opened 3 years ago

cchanzl commented 3 years ago

@hassan-digicatapult @digigarlab @jc-digicatapult @nathandigicatapult

Description

Global model does not stop updating when running MNIST FedAvg example.

Steps to Reproduce

  1. run "mnist_fed_avg_server.py"
  2. run "mnist_fed_avg_worker.py" 3 times, once for each of the 3 workers

Expected behaviour:

Global model should be updated and its change be notified to each worker

Actual behaviour:

Nothing happens, process is stuck at "Updating the global model."

Reproduces how often:

100%

System: I am running wsl ubuntu 20.04 and it appears that this might be the reason why it is so slow particulart at the line "agg_val = state_dicts[0][key] update_sizes[0]" in fed_avg_server.py for torch.size([128, 9216]) 640

dc_federated

hassan-digicatapult commented 3 years ago

The line you mention should not cause a problem - it's just multiplying a fairly small tensor with an integer. I just ran it on a fresh install and it ran as expected on a mac.

Screenshot 2021-07-05 at 09 26 15

I will run on a Ubuntu machine to confirm.

In the meantime, can you please

Can you also try printing out the parameters to agg_param (key, state_dicts, update_sizes) in the first line of agg_params when key == 'fc1.weight' (this corresponds to the shape that you mention) - would be useful to see what's actually getting passed to it.

Also, I noticed you ran with with digit-class 0 on multiple workers - ideally you should run with --digit-class 0 and 1 and 2 for the three different workers - this results in using different datasets for the different workers that cover all the digits - otherwise the server side test performance will forever remain suboptimal.

cchanzl commented 3 years ago

Thanks for your quick reply.

  1. I was running on the main branch for the screenshot above but has since forked to add logger.info() to allow for easier debugging. That was how I discovered that the model was stuck at that particular line (see attached complete printout to terminal, including the requested information)

Terminal output from server.txt

  1. See below for pip freeze
  2. See below for top output
  3. Intel(R) Core(TM) i5-7300HQ 2.5GHz with 16GB RAM

image

image

image

hassan-digicatapult commented 3 years ago

Thanks for your terminal output.

I noticed you are using python 3.8 - the library and its dependencies in requirements.txt has so far been tested only with 3.7. It appears that there may be compatibility issue with python 3.8 and torch 1.4.0 and torchvision 0.5.0 (which are the pinned versions in dc-federated):

https://stackoverflow.com/questions/60137572/issues-installing-pytorch-1-4-no-matching-distribution-found-for-torch-1-4

Can you try recreating the virtual environment with python 3.7?

Or alternatively you might try installing the torch versions compatible with python 3.8 (in the above link) in the venv - however, I can't guarantee that would work.

cchanzl commented 3 years ago

Thanks for the spot. I have now used python 3.7. The issue was related to pip not using shims correctly as per the issue in the link below. That is now fixed.

https://github.com/pyenv/pyenv/issues/1342

However, the issue persist and I am still unable to get pass "Updating the global model".

An updated terminal output is attached.

Thank you so much for your help. New terminal output.txt

hassan-digicatapult commented 3 years ago

I can now reproduce the error on Ubuntu 20.04. My current assessment is that the error is being caused by an incompatibility between the gevent library (used to support multiple requests in parallel) and pytorch's multiprocessing capability on Ubuntu. gevent modifies ('monkey-patches') some standard libraries at run-time to get really good performance - it seems this modification does not quite work with pytorch's own multi-processing facilities. I am looking into potential work-arounds.

cchanzl commented 3 years ago

Thanks for your help. Hope to hear from you soon. In the meantime, is there an alternative OS that you would recommend in the meantime to get this up and running?

hassan-digicatapult commented 3 years ago

It should works as is on Mac/OSX.

hassan-digicatapult commented 3 years ago

@cchanzl Ok so I have found a workaround - I will create a branch and post the update soon. I have also created a minimal example showing where the problem comes from (and the work-around) and created bug reports in both pytorch issues and gunicorn issues. Hopefully, we will be able to get to the bottom of this. This is a really weird problem.

https://github.com/pytorch/pytorch/issues/61660

https://github.com/benoitc/gunicorn/issues/2608

cchanzl commented 3 years ago

Hi @hassan-digicatapult , thanks for the support.

I look forward to the workaround and to implement dc-federated's framework soon.

hassan-digicatapult commented 3 years ago

@cchanzl I have pushed the fix in this branch

I've tested this on my local Ubuntu 18.04 and AWS EC2 18.04 machines and works in both - it will be interesting to see if this works for you in 20.04. gevent-running

The changes are minimal - however I would like to understand the issue better before I make a PR into develop. The discussion is ongoing in the pytorch issue I linked to above

cchanzl commented 3 years ago

Thanks @hassan-digicatapult I confirm that the MNIST test example works on my end based on the branch you created

image

hassan-digicatapult commented 3 years ago

Great @cchanzl ! I'll keep this issue open for now to see if we can resolve the underlying issue in a more satisfying manner in the pytorch issues.