Closed DavdGao closed 2 years ago
Thanks for letting us know. @dywsjtu Can you please check it for now? I will double check it later today if needed.
Also, I have checked the log file, and found one client (whose index is 403) started to train but cannot finished the training I can only find the following logging:
(05-26) 14:13:06 INFO [client.py:17] Start to train (CLIENT: 403) ...
but the 403 client didn't report completion or failure
You can first add a if-condition before line, checking that len(client_data)>0, otherwise break the loop. Let us know whether it solves the problem.
We will find out the bug and fix it later
Hello. We tried to reproduce these issues, but our experiment runs well. Please try to pull the latest version, and follow this conf.yml to have a try. :)
@fanlai0990 Thanks, I know where is wrong. Since I set filter_less
as 0 and FedScale drops the last epoch by default during training, some clients may have less than 20 training samples and cannot jump out the training loop. Therefore the training process is blocked.
Maybe some hints for the relation between batch_size
and filter_less
~
I'm trying to run the example of FEMNIST with a two-layer Conv network on a ubuntu server with 8 gpus. The following is my yaml:
FedScale is blocked during training, so I add some logging in aggregator.py to monitor the running status as follows
and I obtain the following logs:
It seems like the server only get 99 models from the clients and the server continues to wait for the missing client. I guess maybe it is related to the report
BrokenPipeError
?So what can I do to deal with it?