AIforGoodSimulator / model-server

MIT License
11 stars 9 forks source link

Running code using Dask Distributed #136

Closed kariso2000 closed 3 years ago

kariso2000 commented 3 years ago

We will require code changes for us to use distributed servers:

Errors that I have seen during testing:

1 - The worker host need to have a copy of the ai4good code. We can do this by zipping up the code and sending it to workers:

>>> from dask.distributed import Client
>>> c = Client('10.0.2.15:8786')
>>> c.upload_file('ai4good.zip')

However when this is done I get the following error from gunicorn :

2020-11-11 23:45:41,421 - ai4good.webapp.model_runner - ERROR - Model run ('compartmental-model', 'better_hygiene_six_month', 'Moria') failed: ['  File "/home/vagrant/dask-worker-space/dask-worker-space/worker-chwgbw86/ai4good.zip/ai4good/webapp/model_runner.py", line 218, in _sync_run_model\n', '  File "/home/vagrant/dask-worker-space/dask-worker-space/worker-chwgbw86/ai4good.zip/ai4good/models/model_registry.py", line 31, in create_params\n', '  File "/home/vagrant/.local/lib/python3.6/site-packages/typeguard/__init__.py", line 822, in wrapper\n    retval = func(*args, **kwargs)\n', '  File "/home/vagrant/dask-worker-space/dask-worker-space/worker-chwgbw86/ai4good.zip/ai4good/params/param_store.py", line 80, in get_params\n', '  File "/home/vagrant/.local/lib/python3.6/site-packages/typeguard/__init__.py", line 822, in wrapper\n    retval = func(*args, **kwargs)\n', '  File "/home/vagrant/dask-worker-space/dask-worker-space/worker-chwgbw86/ai4good.zip/ai4good/params/param_store.py", line 122, in _read_csv\n', '  File "/home/vagrant/dask-worker-space/dask-worker-space/worker-chwgbw86/ai4good.zip/ai4good/utils/path_utils.py", line 21, in params_path\n', '  File "/home/vagrant/dask-worker-space/dask-worker-space/worker-chwgbw86/ai4good.zip/ai4good/utils/path_utils.py", line 56, in _path\n', '  File "/usr/lib/python3.6/os.py", line 210, in makedirs\n    makedirs(head, mode, exist_ok)\n', '  File "/usr/lib/python3.6/os.py", line 210, in makedirs\n    makedirs(head, mode, exist_ok)\n', '  File "/usr/lib/python3.6/os.py", line 210, in makedirs\n    makedirs(head, mode, exist_ok)\n', '  [Previous line repeated 2 more times]\n', '  File "/usr/lib/python3.6/os.py", line 220, in makedirs\n    mkdir(name, mode)\n']

and the following error on the worker:

2020-11-11 23:45:41,366 - ai4good.webapp.model_runner - INFO - Running compartmental-model model with better_hygiene_six_month profile
distributed.worker - WARNING -  Compute Failed
Function:  _sync_run_model
args:      (<ai4good.runner.facade.Facade object at 0x7f75092e22b0>, 'compartmental-model', 'better_hygiene_six_month', 'Moria')
kwargs:    {}
Exception: NotADirectoryError(20, 'Not a directory')

2 - daemonic processes are not allowed to have children

gunicorn:

2020-11-11 23:55:17,963 - ai4good.webapp.model_runner - ERROR - Model run ('compartmental-model', 'only_remove_high_risk', 'Moria') failed: ['  File "/home/vagrant/Projects/model-server/ai4good/webapp/model_runner.py", line 221, in _sync_run_model\n    mr = _mdl.run(params)\n', '  File "/home/vagrant/.local/lib/python3.6/site-packages/typeguard/__init__.py", line 822, in wrapper\n    retval = func(*args, **kwargs)\n', '  File "/home/vagrant/Projects/model-server/ai4good/models/cm/cm_model.py", line 30, in run\n    p.control_dict[\'numberOfIterations\'], p.control_dict[\'t_sim\'],  p.control_dict[\'nProcesses\'])\n', '  File "/home/vagrant/Projects/model-server/ai4good/models/cm/simulator.py", line 364, in simulate_over_parameter_range_parallel\n    sols = dask.compute(*lazy_sols)\n', '  File "/home/vagrant/.local/lib/python3.6/site-packages/dask/base.py", line 452, in compute\n    results = schedule(dsk, keys, **kwargs)\n', '  File "/home/vagrant/.local/lib/python3.6/site-packages/dask/multiprocessing.py", line 196, in get\n    pool = context.Pool(num_workers, initializer=initialize_worker_process)\n', '  File "/usr/lib/python3.6/multiprocessing/context.py", line 119, in Pool\n    context=self.get_context())\n', '  File "/usr/lib/python3.6/multiprocessing/pool.py", line 174, in __init__\n    self._repopulate_pool()\n', '  File "/usr/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool\n    w.start()\n', '  File "/usr/lib/python3.6/multiprocessing/process.py", line 103, in start\n    \'daemonic processes are not allowed to have children\'\n']

worker

2020-11-11 23:55:17,475 - ai4good.webapp.model_runner - INFO - Running compartmental-model model with only_remove_high_risk profile
2020-11-11 23:55:17,591 - ai4good.webapp.model_runner - INFO - Running model for camp Moria
2020-11-11 23:55:17,592 - ai4good.models.cm.simulator - INFO - Running parallel simulation with 4 processes
distributed.worker - WARNING -  Compute Failed
Function:  _sync_run_model
args:      (<ai4good.runner.facade.Facade object at 0x7f2f8a62e3c8>, 'compartmental-model', 'only_remove_high_risk', 'Moria')
kwargs:    {}
Exception: AssertionError('daemonic processes are not allowed to have children',)
kariso2000 commented 3 years ago

item 2 can be fixed by updating ai4good/models/cm/simulator.py and changing:

        with dask.config.set(scheduler='processes', num_workers=n_processes):

to

        with dask.config.set(scheduler='single-threaded', num_workers=1):

However need to understand the impact of this.

kariso2000 commented 3 years ago

@pardf @billlyzhaoyh - Do you know who can assist with item 1?

billlyzhaoyh commented 3 years ago

With item 2 that was my fix to get it to work as well but I don't know if that will slow down the compute or not. Item 1 seems like the CSV param files are not copied across correctly or the path utils is not working correctly pointing to the right folders to look for parameters

pardf commented 3 years ago

@kariso2000 For item 1, the correct directory of the file needs to be specified using path_util, as discussed in Slack.

pardf commented 3 years ago

@kariso2000 For item 2, we should be running at least Python 3.7 because this was required for another process. Could you check that is the case for the workers as well please? @billlyzhaoyh The change substantially slowed down the running of CM models.

@kariso2000 @billlyzhaoyh we could try this: https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic

kariso2000 commented 3 years ago

@kariso2000 For item 2, we should be running at least Python 3.7 because this was required for another process. Could you check that is the case for the workers as well please? @billlyzhaoyh The change substantially slowed down the running of CM models.

@kariso2000 @billlyzhaoyh we could try this: https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic

Yes the non daemonic code (multithread vs multi process) would work. Let's see the performance of the dask-distributed and we can have this as a future performance enhancement. I believe we also need to chunk our model processing to make best use of DD.

kariso2000 commented 3 years ago

@kariso2000 For item 1, the correct directory of the file needs to be specified using path_util, as discussed in Slack.

We need to remove all reading and writing from the file system. This should be pushed into the database ideally or redis.

kariso2000 commented 3 years ago

I've created two issues so we can pick up at a later date.

kariso2000 commented 3 years ago

@billlyzhaoyh @pardf closing