Open EricDinging opened 1 year ago
@AmberLJC I have started running the experiment. The config is listed here. Several issues appear:
score
field query seems not working after running the system for 30 minutes. For example, if I typed @score: [0, 0]
, it should return all the job with score 0. However, even there are such jobs, I did not get any results. I tried redis-cli
, to directly query the database without Python, and it was the same issue. The other field, such as demand
, works fine. I have to use a workaround to achieve the FIFO algorithm. I have not experienced such issue before, but it seems to happen almost every time in the simulation.I will continue to improve the fault tolerance of the system, and tried to get a complete run.
And I'm also implementing multiple GPU processes to boost training and cut simulation time for later run. Will implement myself instead of using FedScale executor since the latter supports only one job.
Regarding https://github.com/EricDinging/Propius/issues/8#issuecomment-1681442824.
Wierdly, the pytorch training backend got stuck after running for a while. It has happend several times. Some goole search tells me that it might be the issue with num_loaders > 1
, causing deadlocks. At the same time I believe it is not a my code's deadlock. I'll set num_loaders = 1
in the future.
The time-acc plot is a great progress, Good job!
Just come back, what is the status now?
Regarding #8 (comment).
- I think it might be caused by the small memory size of the node
- The system works fine if I connect and disconnect the connection between job server and job manager on a round basis.
- I will work out a client trace that is limited in size to increase contention between jobs. Current FedScale trace is too large I think. BTW I'm using a trace that dispatch 10 client every second. The new one would be similar to FedScale, but smaller I think.
Why would your process need a lot of memory (or CPU memory?) that caused the disconnection?
I'm trying to run two experiments at once on v100 node, one FIFO, and one IRS. Experienced several issues.
- (check in time - system start time)
, but it doesn't work. Maybe it's because of the negative value. Right now I just use the time field instead of the score field.open telemetry
to do this kind of stuff.num_loaders
issue. I actually implemented the multi-worker code, offloading training tasks from different jobs different rounds different clients to multiple GPU processes, and I've been experiencing the issue. At first I thought it might be some deadlocks, so I change to the previous working single-worker code and tries to run it, however still having the same issue. Given that I changed the environment a little bit (different pytorch+cuda after reinstall anaconda to /data, running in /data), I think it might be because of the environment? BTW have you faced the issue when running FedScale @AmberLJC? So currently the issue is mainly the training part. I do think about building a Propius simulator to run everything in a event queue, but I haven't solve the training issue so I don't want to move forward. The next step for me I think would be
I'm preparing for GRE for tommorrow, and am going to help pickup new DDers at DTW after that. I'll probably start working on Tuesday.
model.train()
caused the problem? It's better to print out the error to figure it out. It could be problems like 'drop data in last batch' or 'not enough training data' (if you do some filtering by data size.) I didn't encounter your problem in fedscale.The plan sounds good. Please take your time!
"log connections and data in the process memory": Counting the number of connections per second, tracking the number of online jobs / clients etc
The trouble is the program doesn't print out anything lol (I have a try except block and will record the error caught), simply stuck there.
I now move the program to ~/
and it seems to work fine so far (3h passed)? Maybe there is some system restrictions when in /data
and have to become root. Anyway I think this time the space is enough as I have moved anaconda package to /data
.
Finishing this experiment is soo hard.. Anyway it does give me some opportunities to observe the system running for a long time, and I improved it here and there.
Yes, that's why building systems take time longer than expected. And when is your GRE test? we can find a time to discuss all issues this week.
@AmberLJC I'm facing issues with mobilenet testing. There is a testing phase when every job begins, every 10 rounds. The first mobilenet testing is fine. However, for the later tests, the testing loss is NaN
023-09-10 21:10:08,987 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 81.0982, 'acc': 0.0229, 'acc_5': 0.112, 'test_len': 393}===
2023-09-10 21:10:08,988 - INFO - Worker 0: executing job 60006 model_test, Client 336
2023-09-10 21:10:09,147 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 77.7806, 'acc': 0.0122, 'acc_5': 0.1102, 'test_len': 245}===
2023-09-10 21:10:09,147 - INFO - Worker 0: executing job 60006 model_test, Client 340
2023-09-10 21:10:09,246 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 77.3838, 'acc': 0.0, 'acc_5': 0.02, 'test_len': 150}===
2023-09-10 21:10:09,246 - INFO - Worker 0: executing job 60006 model_test, Client 344
2023-09-10 21:10:09,359 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 80.2498, 'acc': 0.0, 'acc_5': 0.0514, 'test_len': 175}===
2023-09-10 21:10:09,359 - INFO - Worker 0: executing job 60006 model_test, Client 348
2023-09-10 21:10:09,620 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 79.5947, 'acc': 0.0099, 'acc_5': 0.0765, 'test_len': 405}===
2023-09-10 21:10:09,620 - INFO - Worker 0: executing job 60006 model_test, Client 352
2023-09-10 21:10:09,728 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 77.4984, 'acc': 0.0059, 'acc_5': 0.0651, 'test_len': 169}===
The above is the first test phase. Here I use port number to uniquely identify jobs.
2023-09-10 21:37:44,443 - INFO - Worker 0: executing job 60006 model_test, Client 1
2023-09-10 21:37:44,542 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0066, 'acc_5': 0.1788, 'test_len': 151}===
2023-09-10 21:37:44,543 - INFO - Worker 0: executing job 60006 model_test, Client 5
2023-09-10 21:37:44,764 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0141, 'acc_5': 0.1469, 'test_len': 354}===
2023-09-10 21:37:44,764 - INFO - Worker 0: executing job 60006 model_test, Client 9
2023-09-10 21:37:44,870 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0062, 'acc_5': 0.1429, 'test_len': 161}===
2023-09-10 21:37:44,870 - INFO - Worker 0: executing job 60006 model_test, Client 13
2023-09-10 21:37:44,979 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0059, 'acc_5': 0.1657, 'test_len': 169}===
2023-09-10 21:37:44,980 - INFO - Worker 0: executing job 60006 model_test, Client 17
And here is what's going on after dozens of rounds.
This is where I think the problem is. I'm confused that the accuracy calculation seems to be fine,
can you check the training loss is decreasing?
The testing accuracy is not increasing after 30 rounds. This is the testing results. I cannot get the aggregated results as there is an error with NaN
2023-09-11 00:26:16,064 - INFO - Worker 0: executing job 60007 model_test, Client 323
2023-09-11 00:26:16,188 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0055, 'acc_5': 0.1585, 'test_len': 183}===
2023-09-11 00:26:16,188 - INFO - Worker 0: executing job 60007 model_test, Client 327
2023-09-11 00:26:16,426 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0322, 'acc_5': 0.1394, 'test_len': 373}===
2023-09-11 00:26:16,426 - INFO - Worker 0: executing job 60007 model_test, Client 331
2023-09-11 00:26:16,529 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0126, 'acc_5': 0.1509, 'test_len': 159}===
2023-09-11 00:26:16,530 - INFO - Worker 0: executing job 60007 model_test, Client 335
2023-09-11 00:26:16,644 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0112, 'acc_5': 0.1685, 'test_len': 178}===
2023-09-11 00:26:16,644 - INFO - Worker 0: executing job 60007 model_test, Client 339
2023-09-11 00:26:16,755 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0057, 'acc_5': 0.1667, 'test_len': 174}===
2023-09-11 00:26:16,756 - INFO - Worker 0: executing job 60007 model_test, Client 343
2023-09-11 00:26:16,977 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0171, 'acc_5': 0.1538, 'test_len': 351}===
2023-09-11 00:26:16,977 - INFO - Worker 0: executing job 60007 model_test, Client 347
2023-09-11 00:26:17,080 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0063, 'acc_5': 0.1375, 'test_len': 160}===
2023-09-11 00:26:17,080 - INFO - Worker 0: executing job 60007 model_test, Client 351
2023-09-11 00:26:17,226 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0178, 'acc_5': 0.1689, 'test_len': 225}===
2023-09-11 00:26:17,226 - INFO - Worker 0: executing job 60007 model_test, Client 355
2023-09-11 00:26:17,326 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0067, 'acc_5': 0.1467, 'test_len': 150}===
Can you get the average training loss at the end of each training round? And is it decreasing?
023-09-10 20:27:55,361 - INFO - Job 60001 round 12 {'client_train24881': {'moving_loss': 1.524259606000738, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train9837': {'moving_loss': 5.950031767558956, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train44171': {'moving_loss': 3.120353504942049, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train27294': {'moving_loss': 4.439653188450645, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train86099': {'moving_loss': 5.798359791438423, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train9837': {'moving_loss': 3.7063413617744665, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train14671': {'moving_loss': 5.872559078056919, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train44430': {'moving_loss': 4.022340729383544, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train52709': {'moving_loss': 3.698265441245853, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train48909': {'moving_loss': 3.3655428537235963, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train62221': {'moving_loss': 3.9616769535789396, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train37247': {'moving_loss': 5.873217484589201, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train63369': {'moving_loss': 6.577374515656859, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train3238': {'moving_loss': 6.7519126976683514, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train15328': {'moving_loss': 2.009097845049702, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train64210': {'moving_loss': 0.5159911560113171, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train55455': {'moving_loss': 1.6376401827038753, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train38088': {'moving_loss': 1.5693888594544652, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train64947': {'moving_loss': 1.1833783856620863, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train53781': {'moving_loss': 1.6690839965491728, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train19048': {'moving_loss': 1.069482520588956, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train19318': {'moving_loss': 1.660914343551939, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train14515': {'moving_loss': 0.5842370117612977, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train101200': {'moving_loss': 1.4200070821559012, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train23230': {'moving_loss': 1.3875898191328593, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train23230': {'moving_loss': 2.0840670098925993, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train20027': {'moving_loss': 1.2784322341827243, 'trained_size': 200}}
The training loss is decreasing. Sry I do not have the aggregated one.
The NaN only happens in the mobilenet cases. Resnet18 is all good.
BTW do you think the accuracy increase is fast enough for the resnet18 cases? This is job 0.
round,test_loss,acc,acc_5,test_len
0,0.3579431726275213,7.033724284760453e-05,0.0004562340165534967,79379
10,0.13006014815001452,0.002547527683644287,0.003919217929175221,79379
20,0.09549642726665747,0.0029992038196500355,0.0041731339523047705,79379
30,0.08133092379596615,0.00319560211138966,0.004259828166139662,79379
40,0.07295411506821711,0.003310587183008102,0.004299620806510539,79379
50,0.06718411418637174,0.0033631035916300307,0.004327121782839296,79379
Its setting
# Training and testing aggregator setting
demand: 50
total_round: 500
over_selection: 1.3
# Training and testing client setting
engine: pytorch
model: resnet18
dataset: femnist
learning_rate: 0.05
num_loaders: 4
local_steps: 10
loss_decay: 0.95
batch_size: 20
gradient_policy: fed-avg
# Client constraints
public_constraint:
cpu_f: 8
ram: 6
fp16_mem: 800
android_os: 8
private_constraint:
dataset_size: 150
Thanks for the additional info. I doubt it's a model-specific bug, I will take a deeper look the accuracy is increasing too slow, but the training loss seems fine
can you double check your config is the same in fedscale? For example: loss_decay is 0.2? (minor: local step=5 is enough https://github.com/SymbioticLab/FedScale/blob/faab2832de4d8e32d39c379cc3cd7999992f8dd3/fedscale/cloud/config_parser.py#L79
wait, your config is resnet, and the training for resnet is fine
I suddenly realize something after finding that the individual testing accuracy is actually good:
2023-09-11 00:50:47,110 - INFO - Worker 0: executing job 60000 model_test, Client 330
2023-09-11 00:50:47,198 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 12.1196, 'acc': 0.8198, 'acc_5': 0.9767, 'test_len': 172}===
2023-09-11 00:50:47,199 - INFO - Worker 0: executing job 60000 model_test, Client 334
2023-09-11 00:50:47,295 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 8.5938, 'acc': 0.8182, 'acc_5': 1.0, 'test_len': 187}===
2023-09-11 00:50:47,296 - INFO - Worker 0: executing job 60000 model_test, Client 338
2023-09-11 00:50:47,382 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 10.968, 'acc': 0.7831, 'acc_5': 0.9819, 'test_len': 166}===
2023-09-11 00:50:47,382 - INFO - Worker 0: executing job 60000 model_test, Client 342
2023-09-11 00:50:47,471 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 8.7946, 'acc': 0.8382, 'acc_5': 0.9769, 'test_len': 173}===
2023-09-11 00:50:47,471 - INFO - Worker 0: executing job 60000 model_test, Client 346
2023-09-11 00:50:47,549 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 26.9884, 'acc': 0.5906, 'acc_5': 0.8792, 'test_len': 149}===
2023-09-11 00:50:47,549 - INFO - Worker 0: executing job 60000 model_test, Client 350
2023-09-11 00:50:47,630 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 4.5499, 'acc': 0.925, 'acc_5': 0.9875, 'test_len': 160}===
2023-09-11 00:50:47,630 - INFO - Worker 0: executing job 60000 model_test, Client 354
2023-09-11 00:50:47,774 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 11.4617, 'acc': 0.7936, 'acc_5': 0.9929, 'test_len': 281}===
I made a mistake in the testing result aggregator, dividing the sum of accuracy by test_len
instead of number of clients.
But I think FedScale has the problem as well (took from it literally):
Here the accuracy is divided by test_len. I guess that's why the testing result seems to be not increasing in my PR? Because there is a large denominator?
wait, your config is resnet, and the training for resnet is fine
Yes the config is resnet. The testing is fine for resnet, but not for mobilenet. The training of both resnet and mobilenet are fine.
I think the only issue now is the testing loss NaN for mobilenet.
Do you think we should ditch mobilenet for now and only run resnet? It might take some time to find the bugs in mobilenet. And also lower the demand a little bit? Like 10, 50, 100?
Though it does not look good for our multi-model system by running one type of job, I do want to get the plots as early as possible and make improvements later.
why is the bug (devided by test_len) only apply to mobilenet?
And the test acc was right in fedscale, so i took some time look into it
There are two separate issues going on. One is NaN in mobilenet testing. The other is the low accuracy for the working resnet testing (and for other models as well). I think this is due to the test_len.
but test_acc can still be used, right? I can see acc_5 is increasing
Yep, need to do some math afterwards.
So many things are happening at the same time... One bug fixed another came to mind...