Memory consumption rising during simulation

ipremuzic commented 3 years ago

Hi! I tried to run your Single-Machine Simulation of Federated Learning Systems example and I encountered something that looks like a memory leak. Memory usage during training simulation just keeps rising.

I cloned the example (no changes to the example code were made), installed project dependencies using poetry as recommended in instructions, my python version is 3.6.9 and package versions are the same as in pyproject.toml:

python = "^3.6.2"
flwr = "^0.15.0"  # For development: { path = "../../", develop = true }
tensorflow-cpu = "^2.4.1"

I'm running Ubuntu 18.04.5 on machine with 48 GB of RAM and I tried to run 100 clients for 100 rounds but after just a few rounds my system ran out of memory. On the second attempt I monitored memory consumption and observed that it is rising pretty fast during training. The same behaviour is seen when running fewer clients (e.g. 10 clients), but the rate at which memory consumption is rising is slower so training actually manages to finish before using up all the memory. I don't think that this is wanted behaviour. Do you have some info on this issue?

Thanks for all the help.

tanertopal commented 3 years ago

Hi @ipremuzic, I'll look into it and will give you feedback quite soon.

tanertopal commented 3 years ago

@ipremuzic I looked into this. The result is fortunately that I could not find a memory leak - my setup always maxed out on memory usage after a few dozen rounds. Tensorflow is quite a memory hungry beast so the best I could do after a few optimizations is 40 clients on 45GB. If you are interested I can adjust the simulation example to be slightly more memory efficient.

We already have on the roadmap the possibility to simulate systems with an infinite number of clients as long as one fits into memory.

ipremuzic commented 3 years ago

@tanertopal that is strange, because I see rising memory consumption throughout training process. Yes, after a few rounds memory usage is rising at a slower rate but it is still rising and continues to do so until the end. Also, the rate at which it is rising depends on number of active clients and the number of clients that are being sampled every round. If I try to run simulation with 40 clients for 200 rounds and fraction_fit set to 0.5 (20 clients sampled every round) it never finishes, all memory gets used before the end of training. The same setup but for 100 rounds manages to finish with a few GB of free memory left. Simulation with 40 clients for 200 rounds but with fraction_fit set to 0.01 (2 clients sampled every round) also manages to finish training.

I am interested in more memory efficient simulation example if you can provide it.

We already have on the roadmap the possibility to simulate systems with an infinite number of clients as long as one fits into memory.

If I understand it correctly, in the future there will be a possibility to simulate systems with many clients as long as just one client fits into memory? If yes, do you have an idea when this possibility will be available?

Also, if you could provide more details about your setup used for testing this issue, python version etc.?

tanertopal commented 3 years ago

@ipremuzic I will create a PR this week and we can discuss the impact you can see.

If I understand it correctly, in the future there will be a possibility to simulate systems with many clients as long as just one client fits into memory? If yes, do you have an idea when this possibility will be available?

There is no specific timeline but I will discuss that and see how we can prioritize it

Also, if you could provide more details about your setup used for testing this issue, python version etc.?

I have run all code changes with the run.sh script in the example. The script runs the example in Docker. The Docker container is using 20.04. My machine has 64GB of memory. The host OS is Windows although I am using WSL in Windows with Ubuntu 18.04 as the host for Docker itself.

tanertopal commented 3 years ago

@ipremuzic can you have a look at the code in the draft PR and try running that? Does it improve on your issue?

ipremuzic commented 3 years ago

@tanertopal I tried running the improved memory version of example, below are two graphs showing available memory during simulation. I ran both regular and improved example inside Docker using run.sh script.

graph_1000_10_0 5_2 graph_500_40_0 1_2

On the y-axis is available system memory in megabytes. As you can see there is slight improvement but not much, and the overall trend of rising memory consumption is still present. Also, when running other examples (advanced_tensorflow_example) I see the same behaviour, memory consumption keeps rising. So I don't think that the problem is in the code of a particular example.

tanertopal commented 3 years ago

@ipremuzic This was also running on the following specs?

Ubuntu 18.04.5 on machine with 48 GB of RAM

Could you provide me with the exact code/setup you used to track memory and plot it? I will try to replicate your setup and use that as a basis when looking for the issue. I'll remove the ML Framework related code and replace the weights with just a numpy array of similar size. That way we can measure the memory impact of Flower alone.

ipremuzic commented 3 years ago

@tanertopal Yes, the specs are still the same as before.

For tracking memory I used free -ms 2 >> mem_usage.log, so that free utility gets called every 2 seconds and writes output to file. That way I got amount of available and used memory during simulation. I decided to plot available memory so I deleted all other columns and also deleted the first row for easier parsing. In the end the log looked like this:

   42088

available
    41742

available
    41696
...

After that I parsed the log file in python , stored available memory values in a list and plotted using matplotlib:

import matplotlib.pyplot as plt

with open('mem_usage.log', 'r') as f:
    available_mem = []
    for i, line in enumerate(f):
        if(i % 3 == 0):
            available_mem.append(int(line))

plt.plot(available_mem, 'g', label='available mem')
plt.legend()
plt.show()

tanertopal commented 3 years ago

Perfect! Thx. I'll try replicate your system and see if I can reproduce your results and debug from there.

adap / flower

Memory consumption rising during simulation #700