IBM / federated-learning-lib

A library for federated learning (a distributed machine learning process) in an enterprise environment.
Other
495 stars 134 forks source link

Monitoring Tool #73

Open fernando080 opened 3 years ago

fernando080 commented 3 years ago

Good morning from Spain, I have started with your tool last week and for me is interesting to know is there exist a way to monitorice the training, fusion and synchronization of the models. Thank you for your work and for your research in this area.

Have a nice day, Fernando...

ch4174nya commented 3 years ago

Hi @fernando080, thank you for trying out the platform! I wanted to clarify your comment. Are you asking if there's a way in the platform to monitor the training process as it happens? If so, I believe the experiment manager dashboard should be able to do that for you. You can go over the usage guide for the same. It allows choosing various (hyper) parameters before a run, running an FL experiment, and monitoring through real-time progress bars.

We're also adding more features to it currently, such as supporting custom datasets.

fernando080 commented 3 years ago

Thank you about your comment and the quick response. I need to ellaborate more my previous comment. With the term monitoring i meant:

1- What is the failure tolerance of the tool? By failure tolerance i mean: In the event of a shutdown in one of the party servers, is the whole process going to stop or the rest of the training processes will continue normally? If they continue normally and only one training fail, will it cause a problem in the step of model fusion?

2- If the training process fails (beacuse of a problem with the data or because a server shutdown), is there a possibility of automatically retrying the training process after some specified time interval?

3- Does the sofware provide any logs (feedback) regarding the nature of the problem that potentially caused those failures or info (hyprerparams used during the tranning, warnings, accuracy, p-values of the fundamental hypotesis in LR...) about the tranning process? I know that we can get the metrics with the EVAL command but the first point of this bullet is really importan for me.

Thank you again, Fernando...

Yi-Zoey commented 3 years ago

Hi @fernando080, we have quorum control in IBM FL where you can specify perc_quorum in the aggregator's config file. See here for detailed explanations. Basically, if the specified quorum is reached, the training process will continue even if not every party replies back. Regarding 2, we don't have auto-restarting for now. Regarding 3, IBM FL will print the error message via logger.

chalianwar commented 3 years ago

Regarding 2, you can manually relaunch the party and make party join the training process by issuing the REGISTER command.

fernando080 commented 3 years ago

Thank you for your quick response they are very helpfull for me.

Sorry to reuse this thread, but I have discored the UI tool tanks to @ch4174nya I try to made the execution of the basic example but locally so I have selected the corresponding option, you can see it in the images below

image image

I have create a folder for the staging dir and i write the conda env where i have installed the FL tool in the virtual env camp. But I do not know exacly which path of my system i have to use in IBMFL dir, I am using now ~/Desktop/IBM-FL/federated-learning-lib/examples (ubuntu focal) the conf files are generated well and all is saved in the staging_dir

image

But then i get an error when i continue with the experiment.

image

What am I doing wrong? i think that the problem is with the IBMFL dir, is it right? do I have to define other dir? In the documentation it isn't clear what dirs i have to use in staging and IBMFL so i soupose that it have to be simple...

Again, thank you very much for all your help!

ch4174nya commented 3 years ago

Hi @fernando080 , thanks for providing the details on the issue you're facing. About the IBMFL dir, that should point to the project root directory. So that'd be the directory where this repo was cloned -- <some_path_on_your_machine>/federated-learning-lib. The staging directory gets created at runtime if it doesn't exist and it can be anywhere as long as adequate permissions are in place.

I tried reproducing this at my end, and if I use venv (as opposed to conda) in the screen when it asks you whether to Use conda? it works fine for me. Would it be possible for you to try using virtual environments for the time being?

I tried using conda but I'd need to check with a colleague of mine on that front. We had tested this using conda but it appears that running conda activate from within scripts isn't as seamless