Closed Yunzhen-Liu closed 2 years ago
Can you please help when you are free? @dywsjtu Thanks.
@Yunzhen-Liu Which version of the code base are you using right now? Your forked repo? main repo or can you tell me the specific commit? I can't reproduce your bug in my size with the current up-to-date master branch.
I think the code base is just the current one. Since I ran the experiment on CloudLab and I just download the code by "git clone https://github.com/SymbioticLab/FedScale.git"
Could you maybe run the experiment again on "CloudLab FedScale 240-g5; 1 node" with the conf.yml I provided? Maybe there is some environment issues there
I assume it's the data loader issue, try to make the num_loaders to be 0 and see whether it works. BTW are you currently running experiments on cloudlab?
Yes. I'm running on cloud lab to confirm the existence of the bug again hhhh
I assume it's the data loader issue, try to make the num_loaders to be 0 and see whether it works. BTW are you currently running experiments on cloudlab?
OK I'll try
Yes. I'm running on cloud lab to confirm the existence of the bug again hhhh
Tell me when you finish and I can take over the resource. Just want to reproduce it with the same amount of system resources.
Oh, ok. I can terminate it now.
Oh, ok. I can terminate it now.
Terminated
LOL, you just need to kill your fedscale jobs (i.e. the running jobs) no need to terminate the whole experiment 😅. I can't access your server now.
...... AH OK. I misunderstood.
LOL, you just need to kill your fedscale jobs (i.e. the running jobs) no need to terminate the whole experiment 😅. I can't access your server now.
Emmmm, I think every time I tried CloudLab FedScale 240-g5; 1 node environment creates a bug, So you can start an experiment and I think the bug should appear
No worries, I will create a new one.
LOL, you just need to kill your fedscale jobs (i.e. the running jobs) no need to terminate the whole experiment 😅. I can't access your server now.
Emmmm, I think every time I tried CloudLab FedScale 240-g5; 1 node environment creates a bug, So you can start an experiment and I think the bug should appear
Or I can start an experiment again if you want hhh
Are you in the slack? You can DM me btw.
Are you in the slack? You can DM me btw.
Sure
These should be part of a documentation page where we discuss FedScale config parameters and how to set them.
What happened + What you expected to happen
The training process sometimes crashes unexpectedly after the model evaluation (testing on the testing set).
Versions / Dependencies
OS: Linux (CloudLab FedScale 240-g5; 1 node) FedScale, Python, cuda, etc: installed by "install.sh --cuda" provided by FedScale.
Reproduction script
The conf.yml file I used. conf.yml.zip
Issue Severity
Low: It annoys or frustrates me.