Communication_Manager & Optimizer

kiddyboots216 commented 6 years ago

This is a big one. Tons of changes. @georgymh rewrote communication_manager and fedavgoptimizer pretty much entirely Updated Scheduler and Runner for DMLResults flow Communication and Averaging jobs are currently nonfunctional; I'm going to make separate PRs for those The biggest test is the first one that runs in tests/test_communication_manager because...well, just take a look. I basically can't unit-test, not sure why, initializing the model will hang.

georgymh commented 6 years ago

FYI, I'm making some changes right now but I'm unable to test locally because my local python environment is broken. This means I may fail some tests. I'm trying to fix my environment while I'm making the changes but right now I'm solely relying on Travis to help me test until it's fixed.

georgymh commented 6 years ago

One more thing:

We need to test the correctness of the DMLResult returned by the Runners for each job type

georgymh commented 6 years ago

Also, running the Scheduler's cron job on the main thread may cause unexpected behavior, and we can't instantiate new threads during tests. For the Communication Manager tests, let's manually run the next job by using the proper routine in the Scheduler.

kiddyboots216 commented 6 years ago

I'm not able to replicate the desired behavior using scheduler.runners_run_next_job() and I'm not convinced that multiprocessing works in the Scheduler since (a) there's a decent amount of documentation suggesting that proper multiprocessing for Keras looks different than what we have and (b) There are no tests for multiprocessing, so I'd argue that the behavior we see from running the Scheduler's cron job is more of "what we expect" then just iteratively calling scheduler.runners_run_next_job() with variable-length sleep() in-between.

kiddyboots216 commented 6 years ago

Optimizer is tested in the communication manager tests -separating out the tests is not really worth the effort; it's an incredibly simple class. Addressed all other requests by @georgymh

georgymh commented 6 years ago

@kiddyboots216: I disagree that the optimizer is an "incredibly simple class." I think the Dataset Manager, Scheduler, and Runner are lot less complex than the Optimizer and they all have tests. Also, what you have are integration tests between the Communication Manager and the Optimizer. We need unit tests.

Some ideas of what to unit test:

kickoff returns what's expected
all different types of asks (individually, IOW, touching all possible level 1 and level 2 combinations) return what's expected

georgymh commented 5 years ago

Yes: writing the tests for the optimizers, writing a "train but not initialize" test for the communication manager, possibly breaking up RawEventTypes further for clarity, and updating docstrings if necessary.

kiddyboots216 commented 5 years ago

No "train but not initialize" test for communication manager (communication manager should never have any say over that logic; it relies on the optimizer, and the optimizer's kickoff will overwrite the job_type of the job it gets to INIT, then replace the weights in the job that it got -this is to allow the optimizer to receive initialization info from ANY node not just the Notebook, good for p2p down the road etc.) Added tests for the optimizers Don't think RawEventTypes needs to be broken up until after Gateway PR is merged and we see what the Enums look like with MessageTypes in

georgymh commented 5 years ago

Will take a look and decide on those later.

georgymh commented 5 years ago

Alright, I made the changes I was going to make (reflected in the Trello card "Finalize PR #26).

There are some commented code I left because I'm not sure if @kiddyboots216 backed it up and I didn't want to make him redo his work.

@neeleshdodda44, please review and I will make any necessary changes ASAP.

georgymh / decentralized-ml

Communication_Manager & Optimizer #26