Open Martin-Stevlik opened 3 years ago
Hi,
Thank you for trying IBM FL. Please specify public IP for both aggregator and parties and also make sure you are able to access the ip+port on both machines. You may need to open up the port on the machines depending on the setup. Otherwise, party and aggregator may not be able to establish the connection.
Thank you for your fast reply. I made sure that all ports are available. If I put public IP into aggregator file this error comes out:
All parties' ip+port should be accessible to the aggregator. For aggregator, you can try setting network settings of your VM to allow incoming/outgoing connections to the public ip+port. If your parties are not on the same network how are they able to connect to private ip of Azure VM?
I have no idea :D But I did some experiments I have created aggregator on VM with 0.0.0.0 IP, then tried to connect to it with my pc which has 78.99.213.115 public IP. If I put that public IP in the party configuration file it fails. So I did this: As you can see I have connected to the aggregator and it recognized my pc public IP (red line) but when I use the training method it is still trying to establish a connection with host 0.0.0.0 (blue line)
@chalianwar any idea?
Using 0.0.0.0 is not the issue here. The problem is that aggregator is not able to reach out to party. In your case party is able to reach the aggregator using the ip+port when it registers. However, reverse is not working. When aggregator tries to reach out to party to send training request, it can not establish connection. Where is your party running? Is it on also on Azure? Is it on same network? How did you make your party ip public?
Please note, if your party does not have a static ip, aggregator will not be able to connect to party.
Well my aggregator runs on VM and party on my PC. I just checked my public IP of my PC on this website: https://whatismyipaddress.com/
Okay, so I have tried different VMs from Microsoft AZURE and Amazon Web Services. I even set up port forwarding rules on my router so ports from 8085 and 8086 go to my PC. Still, the same issue. Then, I tried running aggregator on one VM and connect to it from other VM as a party, but the same thing happened during training ...
@chalianwar Unfortunately, I have faced the similar issue. I tried to run aggregator and party on different Virtual Machines on Google Cloud. However, after specifying appropriate external (public) IPs in configurations python throws error 'OSError: [Errno 99] Cannot assign requested address'. Then I tried to use 0.0.0.0 instead of public IP of current Machine and it worked. But while training the aggregator cannot properly parse party's response and thinks that response is not received then it stops by timeout. The problem is with using 0.0.0.0 IP because with two locally connected laptops everything works fine. Please, help with the issue. Thank you.
P.S. I checked public IPs with telnet and they are accessible.
@olegov99 Thanks for trying IBMFL. We will investigate the issue and get back to you soon.
I can confirm what @olegov99 is reported.
In the config file for the aggregator, set ip to be 0.0.0.0. And then the aggregator script on run on a public ip. Use this public ip address in the config file for the party as the ip for the aggregatorl while having 0.0.0.0 for the ip of the party. Everything works up until training begins, after which the server stops by timeout.
I wonder if there is an update on this?
@XinyiYS Hi Michael, In order to let it work I did a quick hack:
For every party specify it's external IP after register.
After all parties are registered, do the next for aggregator: where PARTIES_IP_LIST contains all parties external IPs, and the last line contains the aggregator external IP.
Hope it helps you.
@olegov99 Thank you, I will give it a look!
By the way, do did you manage to successfully run the entire operation, including training and evaluation? I ran into some trouble with copying stdout{}.txts from remote to local where the remote did not seem to create the stdout{}.txts at all and it raised an error. But this is a separate issue.
@XinyiYS @olegov99 Any news on this issue? Did you manage to run truly distributed setup or you gave up eventually?
@ladi-pomsar Unfortunately, I didn't manage to completely resolve the issue and went with an alternative framework as the solution. Nevertheless, many thanks to @olegov99 for looking into this and providing helpful guides.
@ladi-pomsar Yes, I ran a completely distributed setup. Try the following hack https://github.com/IBM/federated-learning-lib/issues/53#issuecomment-886659388
@XinyiYS Thank you for your answer. @olegov99 Thank you, I will look at it.
The problem, I think, is that the party IP is used for two distinct purposes:
So when running the party on a separate network the values for these two roles cannot be the same, but there is only one parameter, so usage for one of both roles will fail.
@chalianwar Can this be changed to bug and maybe fixed? It is over here for a while and seems to be confirmed to be a bug by several users. Thank you
We will add an option to provide two ips for virtual setup in next release.
Hello everyone,
I have created a virtual machine with public IP (40.85.164.96) and private IP (10.1.1.4). I am trying to run aggregator on this VM. Problem is when I am connecting to it with my party PC. When I put private IP into party configuration file it gives me an error.
When I switch to public IP it runs okay but the IP address of the party machine is not the same and it leads to a problem in the training stage. Any idea what I'm doing wrong?
PS: I am using federated-learning-lib-1.0.2