SymbioticLab / FedScale

FedScale is a scalable and extensible open-source federated learning (FL) platform.
https://fedscale.ai
Apache License 2.0
388 stars 119 forks source link

[<FedScale component: Core|Dataloader|etc...>] #160

Closed Yunzhen-Liu closed 2 years ago

Yunzhen-Liu commented 2 years ago

What happened + What you expected to happen

The training process sometimes crashes unexpectedly after the model evaluation (testing on the testing set). image

Versions / Dependencies

OS: Linux (CloudLab FedScale 240-g5; 1 node) FedScale, Python, cuda, etc: installed by "install.sh --cuda" provided by FedScale.

Reproduction script

The conf.yml file I used. conf.yml.zip

Issue Severity

Low: It annoys or frustrates me.

fanlai0990 commented 2 years ago

Can you please help when you are free? @dywsjtu Thanks.

dywsjtu commented 2 years ago

@Yunzhen-Liu Which version of the code base are you using right now? Your forked repo? main repo or can you tell me the specific commit? I can't reproduce your bug in my size with the current up-to-date master branch.

Yunzhen-Liu commented 2 years ago

I think the code base is just the current one. Since I ran the experiment on CloudLab and I just download the code by "git clone https://github.com/SymbioticLab/FedScale.git"

Yunzhen-Liu commented 2 years ago

Could you maybe run the experiment again on "CloudLab FedScale 240-g5; 1 node" with the conf.yml I provided? Maybe there is some environment issues there

dywsjtu commented 2 years ago

I assume it's the data loader issue, try to make the num_loaders to be 0 and see whether it works. BTW are you currently running experiments on cloudlab?

Yunzhen-Liu commented 2 years ago

Yes. I'm running on cloud lab to confirm the existence of the bug again hhhh

Yunzhen-Liu commented 2 years ago

I assume it's the data loader issue, try to make the num_loaders to be 0 and see whether it works. BTW are you currently running experiments on cloudlab?

OK I'll try

dywsjtu commented 2 years ago

Yes. I'm running on cloud lab to confirm the existence of the bug again hhhh

Tell me when you finish and I can take over the resource. Just want to reproduce it with the same amount of system resources.

Yunzhen-Liu commented 2 years ago

Oh, ok. I can terminate it now.

Yunzhen-Liu commented 2 years ago

Oh, ok. I can terminate it now.

Terminated

dywsjtu commented 2 years ago

LOL, you just need to kill your fedscale jobs (i.e. the running jobs) no need to terminate the whole experiment 😅. I can't access your server now.

Yunzhen-Liu commented 2 years ago

...... AH OK. I misunderstood.

Yunzhen-Liu commented 2 years ago

LOL, you just need to kill your fedscale jobs (i.e. the running jobs) no need to terminate the whole experiment 😅. I can't access your server now.

Emmmm, I think every time I tried CloudLab FedScale 240-g5; 1 node environment creates a bug, So you can start an experiment and I think the bug should appear

dywsjtu commented 2 years ago

No worries, I will create a new one.

Yunzhen-Liu commented 2 years ago

LOL, you just need to kill your fedscale jobs (i.e. the running jobs) no need to terminate the whole experiment 😅. I can't access your server now.

Emmmm, I think every time I tried CloudLab FedScale 240-g5; 1 node environment creates a bug, So you can start an experiment and I think the bug should appear

Or I can start an experiment again if you want hhh

dywsjtu commented 2 years ago

Are you in the slack? You can DM me btw.

Yunzhen-Liu commented 2 years ago

Are you in the slack? You can DM me btw.

Sure

dywsjtu commented 2 years ago
  1. Decrease num_loaders
  2. Increase timeout
mosharaf commented 2 years ago

These should be part of a documentation page where we discuss FedScale config parameters and how to set them.