IBM / federated-learning-lib

A library for federated learning (a distributed machine learning process) in an enterprise environment.
Other
500 stars 137 forks source link

Trying to run your framework on Azure Virtual Machine #53

Open Martin-Stevlik opened 3 years ago

Martin-Stevlik commented 3 years ago

Hello everyone,

I have created a virtual machine with public IP (40.85.164.96) and private IP (10.1.1.4). I am trying to run aggregator on this VM. Problem is when I am connecting to it with my party PC. When I put private IP into party configuration file it gives me an error.

vm_problem4 0 vm_probem4 1

When I switch to public IP it runs okay but the IP address of the party machine is not the same and it leads to a problem in the training stage. vm_problem2 vm_problem3 Any idea what I'm doing wrong?

PS: I am using federated-learning-lib-1.0.2

chalianwar commented 3 years ago

Hi,

Thank you for trying IBM FL. Please specify public IP for both aggregator and parties and also make sure you are able to access the ip+port on both machines. You may need to open up the port on the machines depending on the setup. Otherwise, party and aggregator may not be able to establish the connection.

Martin-Stevlik commented 3 years ago

Thank you for your fast reply. I made sure that all ports are available. If I put public IP into aggregator file this error comes out:

vm_problem

chalianwar commented 3 years ago

All parties' ip+port should be accessible to the aggregator. For aggregator, you can try setting network settings of your VM to allow incoming/outgoing connections to the public ip+port. If your parties are not on the same network how are they able to connect to private ip of Azure VM?

Martin-Stevlik commented 3 years ago

I have no idea :D But I did some experiments ibm_agregator I have created aggregator on VM with 0.0.0.0 IP, then tried to connect to it with my pc which has 78.99.213.115 public IP. If I put that public IP in the party configuration file it fails. So I did this: ibm_party As you can see I have connected to the aggregator and it recognized my pc public IP (red line) but when I use the training method it is still trying to establish a connection with host 0.0.0.0 (blue line) ibm_agregator ibm_agregator2

Martin-Stevlik commented 3 years ago

@chalianwar any idea?

chalianwar commented 3 years ago

Using 0.0.0.0 is not the issue here. The problem is that aggregator is not able to reach out to party. In your case party is able to reach the aggregator using the ip+port when it registers. However, reverse is not working. When aggregator tries to reach out to party to send training request, it can not establish connection. Where is your party running? Is it on also on Azure? Is it on same network? How did you make your party ip public?

Please note, if your party does not have a static ip, aggregator will not be able to connect to party.

Martin-Stevlik commented 3 years ago

Well my aggregator runs on VM and party on my PC. I just checked my public IP of my PC on this website: https://whatismyipaddress.com/

Martin-Stevlik commented 3 years ago

Okay, so I have tried different VMs from Microsoft AZURE and Amazon Web Services. I even set up port forwarding rules on my router so ports from 8085 and 8086 go to my PC. Still, the same issue. Then, I tried running aggregator on one VM and connect to it from other VM as a party, but the same thing happened during training ...

olegov99 commented 3 years ago

@chalianwar Unfortunately, I have faced the similar issue. I tried to run aggregator and party on different Virtual Machines on Google Cloud. However, after specifying appropriate external (public) IPs in configurations python throws error 'OSError: [Errno 99] Cannot assign requested address'. Then I tried to use 0.0.0.0 instead of public IP of current Machine and it worked. But while training the aggregator cannot properly parse party's response and thinks that response is not received then it stops by timeout. The problem is with using 0.0.0.0 IP because with two locally connected laptops everything works fine. Please, help with the issue. Thank you.

P.S. I checked public IPs with telnet and they are accessible.

chalianwar commented 3 years ago

@olegov99 Thanks for trying IBMFL. We will investigate the issue and get back to you soon.

XinyiYS commented 3 years ago

I can confirm what @olegov99 is reported.

In the config file for the aggregator, set ip to be 0.0.0.0. And then the aggregator script on run on a public ip. Use this public ip address in the config file for the party as the ip for the aggregatorl while having 0.0.0.0 for the ip of the party. Everything works up until training begins, after which the server stops by timeout.

I wonder if there is an update on this?

olegov99 commented 3 years ago

@XinyiYS Hi Michael, In order to let it work I did a quick hack:

  1. For every party specify it's external IP after register. image

  2. After all parties are registered, do the next for aggregator: image where PARTIES_IP_LIST contains all parties external IPs, and the last line contains the aggregator external IP.

Hope it helps you.

XinyiYS commented 3 years ago

@olegov99 Thank you, I will give it a look!

By the way, do did you manage to successfully run the entire operation, including training and evaluation? I ran into some trouble with copying stdout{}.txts from remote to local where the remote did not seem to create the stdout{}.txts at all and it raised an error. But this is a separate issue.

ladi-pomsar commented 3 years ago

@XinyiYS @olegov99 Any news on this issue? Did you manage to run truly distributed setup or you gave up eventually?

XinyiYS commented 3 years ago

@ladi-pomsar Unfortunately, I didn't manage to completely resolve the issue and went with an alternative framework as the solution. Nevertheless, many thanks to @olegov99 for looking into this and providing helpful guides.

olegov99 commented 3 years ago

@ladi-pomsar Yes, I ran a completely distributed setup. Try the following hack https://github.com/IBM/federated-learning-lib/issues/53#issuecomment-886659388

ladi-pomsar commented 3 years ago

@XinyiYS Thank you for your answer. @olegov99 Thank you, I will look at it.

vital-dhaveloose commented 2 years ago

The problem, I think, is that the party IP is used for two distinct purposes:

So when running the party on a separate network the values for these two roles cannot be the same, but there is only one parameter, so usage for one of both roles will fail.

ladi-pomsar commented 2 years ago

@chalianwar Can this be changed to bug and maybe fixed? It is over here for a while and seems to be confirmed to be a bug by several users. Thank you

chalianwar commented 2 years ago

We will add an option to provide two ips for virtual setup in next release.