Implement large simulation

EricDinging commented 1 year ago

[x] Improve interface
[x] Support multiple tasks, by implementing task queue
[x] Edit client monitor
[x] Segregate simulation time and actual training
[x] Logging
[x] Define trace
[x] Optimize training
[x] Multiple executor
[x] Refine constraints, define fields, remove metric
[x] Run experiments
[x] add accuracy plot
[x] Client id assignment, fixed after first check in
[x] Containerize
[x] Redis interface wrapper
[x] Testing on multiple clients and aggregate
[x] Finished job not deleted
[x] Containerize evaluation
[x] Edit logging
[x] Fix multi executor bug, training freeze
[x] Fix communication break
[x] Fix FIFO score
[ ] add multi alg plot
[ ] Implement tier based client selection
[ ] Add process clean
[ ] Tier based selection

EricDinging commented 1 year ago

@AmberLJC I have started running the experiment. The config is listed here. Several issues appear:

Issues with Redis. In the code listed below, the score field query seems not working after running the system for 30 minutes. For example, if I typed @score: [0, 0], it should return all the job with score 0. However, even there are such jobs, I did not get any results. I tried redis-cli, to directly query the database without Python, and it was the same issue. The other field, such as demand, works fine. I have to use a workaround to achieve the FIFO algorithm. I have not experienced such issue before, but it seems to happen almost every time in the simulation.

https://github.com/EricDinging/Propius/blob/99f0079d4537c59518da1e18545586b3eb644a6a/propius/scheduler/sc_db_portal.py#L115

Job parameter server to job manager connection is not very stable. Parameter server should communicate with job manager at every round, which means the connection time is very long (days). I have two runs that resulted in broken connections between parameter server and job manager after running for several hours, and basically the whole simulation had to stop. Later I included the logic that tries to reconnect after a broken connection, but it didn't work very well. I think I might need to connect and disconnect on a round basis instead of during the whole job lifetime.
The training time is long compared to the scheduling time (~10 times). And the overlap of two job's scheduling period is very rare. I somehow think that scheduling algorithm's effect might not be significant in terms of improving JCT. The influence of scheduling algorithm would increase if we have more jobs in parallel.

EricDinging commented 1 year ago

In the last run, I managed to get some data before the connection failed. Job 0 finished 219 rounds, job 1 finished 77, job 2 finished 150, job 3 finished 54, job 3 finished 133. tta0 tta1 tta2 tta3 tta4

Job round time data: 0.csv 1.csv 2.csv 3.csv 4.csv

EricDinging commented 1 year ago

I will continue to improve the fault tolerance of the system, and tried to get a complete run.

And I'm also implementing multiple GPU processes to boost training and cut simulation time for later run. Will implement myself instead of using FedScale executor since the latter supports only one job.

EricDinging commented 1 year ago

Regarding https://github.com/EricDinging/Propius/issues/8#issuecomment-1681442824.

I think it might be caused by the small memory size of the node
The system works fine if I connect and disconnect the connection between job server and job manager on a round basis.
I will work out a client trace that is limited in size to increase contention between jobs. Current FedScale trace is too large I think. BTW I'm using a trace that dispatch 10 client every second. The new one would be similar to FedScale, but smaller I think.

EricDinging commented 1 year ago

Wierdly, the pytorch training backend got stuck after running for a while. It has happend several times. Some goole search tells me that it might be the issue with num_loaders > 1, causing deadlocks. At the same time I believe it is not a my code's deadlock. I'll set num_loaders = 1 in the future.

AmberLJC commented 1 year ago

The time-acc plot is a great progress, Good job!

Just come back, what is the status now?

As for losing connection, we can either do some heartbeat or reconnect every time as you said. Is the connection used for asking for resources, btw?
As for the relationship between training and scheduling time, it should be fine as it's a problem with the settings. For the situation where resource is not enough, scheduling time would matter. So we just need a sensitivity analysis for resource demand and supply.
Redis problem seems strange. have you fix that? Any new problems?

AmberLJC commented 1 year ago

Regarding #8 (comment).

I think it might be caused by the small memory size of the node

The system works fine if I connect and disconnect the connection between job server and job manager on a round basis.

I will work out a client trace that is limited in size to increase contention between jobs. Current FedScale trace is too large I think. BTW I'm using a trace that dispatch 10 client every second. The new one would be similar to FedScale, but smaller I think.

Why would your process need a lot of memory (or CPU memory?) that caused the disconnection?

EricDinging commented 1 year ago

I'm trying to run two experiments at once on v100 node, one FIFO, and one IRS. Experienced several issues.

Connetion loss I think I have solved it, basically making connections on a round basis between jobs, which is asking for resources, and job manager. Heartbeat is implemented within Propius system (between job manager and scheduler, load balancer and client manager). After several long run, I think on the system-side everything is working fine.
I haven't solve the Redis problem. It only happens when I use check in time as score (FIFO). In IRS, the score field works fine. In FIFO, I tried to adjust the score field value to be smaller - (check in time - system start time), but it doesn't work. Maybe it's because of the negative value. Right now I just use the time field instead of the score field.
Previously, I log connections and data in the process memory for plotting after the run in the system, which indeed creates a lot of overhead, so I just switch off. The system works better (not having disconnections). I use logging instead to offload the data to disk, and I can use that to do analysis afterwards. I think for later I can use open telemetry to do this kind of stuff.
The biggest problem I've been facing is the process stuck during training. I'm using basically the same FedScale backend code. The worker process simply stuck after this line https://github.com/EricDinging/Propius/blob/948bfa82092072db33dcb2a1fea46fe7c5c9b839/evaluation/single_executor/worker.py#L112 after running several clients (100-1000). It seems it isn't the num_loaders issue. I actually implemented the multi-worker code, offloading training tasks from different jobs different rounds different clients to multiple GPU processes, and I've been experiencing the issue. At first I thought it might be some deadlocks, so I change to the previous working single-worker code and tries to run it, however still having the same issue. Given that I changed the environment a little bit (different pytorch+cuda after reinstall anaconda to /data, running in /data), I think it might be because of the environment? BTW have you faced the issue when running FedScale @AmberLJC?

EricDinging commented 1 year ago

So currently the issue is mainly the training part. I do think about building a Propius simulator to run everything in a event queue, but I haven't solve the training issue so I don't want to move forward. The next step for me I think would be

Figure out the bug
Containerize everything so that multiple experiments can run on same node better

I'm preparing for GRE for tommorrow, and am going to help pickup new DDers at DTW after that. I'll probably start working on Tuesday.

AmberLJC commented 1 year ago

This is good, but please explain the current hearbeat later when we meet in person.
Though it's unclear why there is problem with sorting by time, you can just maintain a counter, and increment it once a job check in.
you can use async logging to minimize the interruption of logging. btw, what does it mean: 'log connections and data in the process memory'.
After this line means model.train() caused the problem? It's better to print out the error to figure it out. It could be problems like 'drop data in last batch' or 'not enough training data' (if you do some filtering by data size.) I didn't encounter your problem in fedscale.

The plan sounds good. Please take your time!

EricDinging commented 1 year ago

"log connections and data in the process memory": Counting the number of connections per second, tracking the number of online jobs / clients etc

The trouble is the program doesn't print out anything lol (I have a try except block and will record the error caught), simply stuck there.

I now move the program to ~/ and it seems to work fine so far (3h passed)? Maybe there is some system restrictions when in /data and have to become root. Anyway I think this time the space is enough as I have moved anaconda package to /data.

Finishing this experiment is soo hard.. Anyway it does give me some opportunities to observe the system running for a long time, and I improved it here and there.

AmberLJC commented 1 year ago

Yes, that's why building systems take time longer than expected. And when is your GRE test? we can find a time to discuss all issues this week.

I think for the pytorch problem, it could be pytorch version problem if you don't use the latest release.
When does the code stucking happen usually. You can probably print something inside each batch here?
what changes did you made in dataloader?
I find you migrate a lot of fedscale code, it's hard to make sure every operation is correct.
I will try to run on A40 to test it as well, to make sure it's not the problem of gpu node. Can you send me a updated script to run the system and update the code when you have time?

EricDinging commented 1 year ago

@AmberLJC I'm facing issues with mobilenet testing. There is a testing phase when every job begins, every 10 rounds. The first mobilenet testing is fine. However, for the later tests, the testing loss is NaN

023-09-10 21:10:08,987 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 81.0982, 'acc': 0.0229, 'acc_5': 0.112, 'test_len': 393}===
2023-09-10 21:10:08,988 - INFO - Worker 0: executing job 60006 model_test, Client 336
2023-09-10 21:10:09,147 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 77.7806, 'acc': 0.0122, 'acc_5': 0.1102, 'test_len': 245}===
2023-09-10 21:10:09,147 - INFO - Worker 0: executing job 60006 model_test, Client 340
2023-09-10 21:10:09,246 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 77.3838, 'acc': 0.0, 'acc_5': 0.02, 'test_len': 150}===
2023-09-10 21:10:09,246 - INFO - Worker 0: executing job 60006 model_test, Client 344
2023-09-10 21:10:09,359 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 80.2498, 'acc': 0.0, 'acc_5': 0.0514, 'test_len': 175}===
2023-09-10 21:10:09,359 - INFO - Worker 0: executing job 60006 model_test, Client 348
2023-09-10 21:10:09,620 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 79.5947, 'acc': 0.0099, 'acc_5': 0.0765, 'test_len': 405}===
2023-09-10 21:10:09,620 - INFO - Worker 0: executing job 60006 model_test, Client 352
2023-09-10 21:10:09,728 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': 77.4984, 'acc': 0.0059, 'acc_5': 0.0651, 'test_len': 169}===

The above is the first test phase. Here I use port number to uniquely identify jobs.

2023-09-10 21:37:44,443 - INFO - Worker 0: executing job 60006 model_test, Client 1
2023-09-10 21:37:44,542 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0066, 'acc_5': 0.1788, 'test_len': 151}===
2023-09-10 21:37:44,543 - INFO - Worker 0: executing job 60006 model_test, Client 5
2023-09-10 21:37:44,764 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0141, 'acc_5': 0.1469, 'test_len': 354}===
2023-09-10 21:37:44,764 - INFO - Worker 0: executing job 60006 model_test, Client 9
2023-09-10 21:37:44,870 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0062, 'acc_5': 0.1429, 'test_len': 161}===
2023-09-10 21:37:44,870 - INFO - Worker 0: executing job 60006 model_test, Client 13
2023-09-10 21:37:44,979 - INFO - Worker 0: Job 60006: testing complete, {'test_loss': nan, 'acc': 0.0059, 'acc_5': 0.1657, 'test_len': 169}===
2023-09-10 21:37:44,980 - INFO - Worker 0: executing job 60006 model_test, Client 17

And here is what's going on after dozens of rounds.

https://github.com/EricDinging/Propius/blob/5921fa87922d330571b41c45fff700910a6235bd/evaluation/executor/worker.py#L289

This is where I think the problem is. I'm confused that the accuracy calculation seems to be fine,

AmberLJC commented 1 year ago

can you check the training loss is decreasing?

EricDinging commented 1 year ago

The testing accuracy is not increasing after 30 rounds. This is the testing results. I cannot get the aggregated results as there is an error with NaN

2023-09-11 00:26:16,064 - INFO - Worker 0: executing job 60007 model_test, Client 323
2023-09-11 00:26:16,188 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0055, 'acc_5': 0.1585, 'test_len': 183}===
2023-09-11 00:26:16,188 - INFO - Worker 0: executing job 60007 model_test, Client 327
2023-09-11 00:26:16,426 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0322, 'acc_5': 0.1394, 'test_len': 373}===
2023-09-11 00:26:16,426 - INFO - Worker 0: executing job 60007 model_test, Client 331
2023-09-11 00:26:16,529 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0126, 'acc_5': 0.1509, 'test_len': 159}===
2023-09-11 00:26:16,530 - INFO - Worker 0: executing job 60007 model_test, Client 335
2023-09-11 00:26:16,644 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0112, 'acc_5': 0.1685, 'test_len': 178}===
2023-09-11 00:26:16,644 - INFO - Worker 0: executing job 60007 model_test, Client 339
2023-09-11 00:26:16,755 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0057, 'acc_5': 0.1667, 'test_len': 174}===
2023-09-11 00:26:16,756 - INFO - Worker 0: executing job 60007 model_test, Client 343
2023-09-11 00:26:16,977 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0171, 'acc_5': 0.1538, 'test_len': 351}===
2023-09-11 00:26:16,977 - INFO - Worker 0: executing job 60007 model_test, Client 347
2023-09-11 00:26:17,080 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0063, 'acc_5': 0.1375, 'test_len': 160}===
2023-09-11 00:26:17,080 - INFO - Worker 0: executing job 60007 model_test, Client 351
2023-09-11 00:26:17,226 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0178, 'acc_5': 0.1689, 'test_len': 225}===
2023-09-11 00:26:17,226 - INFO - Worker 0: executing job 60007 model_test, Client 355
2023-09-11 00:26:17,326 - INFO - Worker 0: Job 60007: testing complete, {'test_loss': nan, 'acc': 0.0067, 'acc_5': 0.1467, 'test_len': 150}===

AmberLJC commented 1 year ago

Can you get the average training loss at the end of each training round? And is it decreasing?

EricDinging commented 1 year ago

023-09-10 20:27:55,361 - INFO - Job 60001 round 12 {'client_train24881': {'moving_loss': 1.524259606000738, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train9837': {'moving_loss': 5.950031767558956, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train44171': {'moving_loss': 3.120353504942049, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train27294': {'moving_loss': 4.439653188450645, 'trained_size': 200}}
2023-09-10 21:17:16,979 - INFO - Job 60007 round 1 {'client_train86099': {'moving_loss': 5.798359791438423, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train9837': {'moving_loss': 3.7063413617744665, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train14671': {'moving_loss': 5.872559078056919, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train44430': {'moving_loss': 4.022340729383544, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train52709': {'moving_loss': 3.698265441245853, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train48909': {'moving_loss': 3.3655428537235963, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train62221': {'moving_loss': 3.9616769535789396, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train37247': {'moving_loss': 5.873217484589201, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train63369': {'moving_loss': 6.577374515656859, 'trained_size': 200}}
2023-09-10 21:17:16,980 - INFO - Job 60007 round 1 {'client_train3238': {'moving_loss': 6.7519126976683514, 'trained_size': 200}}

2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train15328': {'moving_loss': 2.009097845049702, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train64210': {'moving_loss': 0.5159911560113171, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train55455': {'moving_loss': 1.6376401827038753, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train38088': {'moving_loss': 1.5693888594544652, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train64947': {'moving_loss': 1.1833783856620863, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train53781': {'moving_loss': 1.6690839965491728, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train19048': {'moving_loss': 1.069482520588956, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train19318': {'moving_loss': 1.660914343551939, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train14515': {'moving_loss': 0.5842370117612977, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train101200': {'moving_loss': 1.4200070821559012, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train23230': {'moving_loss': 1.3875898191328593, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train23230': {'moving_loss': 2.0840670098925993, 'trained_size': 200}}
2023-09-11 00:34:56,709 - INFO - Job 60007 round 32 {'client_train20027': {'moving_loss': 1.2784322341827243, 'trained_size': 200}}

EricDinging commented 1 year ago

The training loss is decreasing. Sry I do not have the aggregated one.

The NaN only happens in the mobilenet cases. Resnet18 is all good.

EricDinging commented 1 year ago

BTW do you think the accuracy increase is fast enough for the resnet18 cases? This is job 0.

round,test_loss,acc,acc_5,test_len
0,0.3579431726275213,7.033724284760453e-05,0.0004562340165534967,79379
10,0.13006014815001452,0.002547527683644287,0.003919217929175221,79379
20,0.09549642726665747,0.0029992038196500355,0.0041731339523047705,79379
30,0.08133092379596615,0.00319560211138966,0.004259828166139662,79379
40,0.07295411506821711,0.003310587183008102,0.004299620806510539,79379
50,0.06718411418637174,0.0033631035916300307,0.004327121782839296,79379

Its setting

# Training and testing aggregator setting
demand: 50
total_round: 500
over_selection: 1.3

# Training and testing client setting
engine: pytorch
model: resnet18
dataset: femnist
learning_rate: 0.05
num_loaders: 4
local_steps: 10
loss_decay: 0.95
batch_size: 20

gradient_policy: fed-avg

# Client constraints
public_constraint:
    cpu_f: 8
    ram: 6
    fp16_mem: 800
    android_os: 8
private_constraint:
    dataset_size: 150

AmberLJC commented 1 year ago

Thanks for the additional info. I doubt it's a model-specific bug, I will take a deeper look the accuracy is increasing too slow, but the training loss seems fine

can you double check your config is the same in fedscale? For example: loss_decay is 0.2? (minor: local step=5 is enough https://github.com/SymbioticLab/FedScale/blob/faab2832de4d8e32d39c379cc3cd7999992f8dd3/fedscale/cloud/config_parser.py#L79

AmberLJC commented 1 year ago

wait, your config is resnet, and the training for resnet is fine

EricDinging commented 1 year ago

I suddenly realize something after finding that the individual testing accuracy is actually good:

2023-09-11 00:50:47,110 - INFO - Worker 0: executing job 60000 model_test, Client 330
2023-09-11 00:50:47,198 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 12.1196, 'acc': 0.8198, 'acc_5': 0.9767, 'test_len': 172}===
2023-09-11 00:50:47,199 - INFO - Worker 0: executing job 60000 model_test, Client 334
2023-09-11 00:50:47,295 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 8.5938, 'acc': 0.8182, 'acc_5': 1.0, 'test_len': 187}===
2023-09-11 00:50:47,296 - INFO - Worker 0: executing job 60000 model_test, Client 338
2023-09-11 00:50:47,382 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 10.968, 'acc': 0.7831, 'acc_5': 0.9819, 'test_len': 166}===
2023-09-11 00:50:47,382 - INFO - Worker 0: executing job 60000 model_test, Client 342
2023-09-11 00:50:47,471 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 8.7946, 'acc': 0.8382, 'acc_5': 0.9769, 'test_len': 173}===
2023-09-11 00:50:47,471 - INFO - Worker 0: executing job 60000 model_test, Client 346
2023-09-11 00:50:47,549 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 26.9884, 'acc': 0.5906, 'acc_5': 0.8792, 'test_len': 149}===
2023-09-11 00:50:47,549 - INFO - Worker 0: executing job 60000 model_test, Client 350
2023-09-11 00:50:47,630 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 4.5499, 'acc': 0.925, 'acc_5': 0.9875, 'test_len': 160}===
2023-09-11 00:50:47,630 - INFO - Worker 0: executing job 60000 model_test, Client 354
2023-09-11 00:50:47,774 - INFO - Worker 0: Job 60000: testing complete, {'test_loss': 11.4617, 'acc': 0.7936, 'acc_5': 0.9929, 'test_len': 281}===

I made a mistake in the testing result aggregator, dividing the sum of accuracy by test_len instead of number of clients.

But I think FedScale has the problem as well (took from it literally):

https://github.com/SymbioticLab/FedScale/blob/faab2832de4d8e32d39c379cc3cd7999992f8dd3/fedscale/cloud/aggregation/aggregator.py#L486C5-L486C5

Here the accuracy is divided by test_len. I guess that's why the testing result seems to be not increasing in my PR? Because there is a large denominator?

https://github.com/SymbioticLab/FedScale/pull/236

EricDinging commented 1 year ago

wait, your config is resnet, and the training for resnet is fine

Yes the config is resnet. The testing is fine for resnet, but not for mobilenet. The training of both resnet and mobilenet are fine.

I think the only issue now is the testing loss NaN for mobilenet.

EricDinging commented 1 year ago

Do you think we should ditch mobilenet for now and only run resnet? It might take some time to find the bugs in mobilenet. And also lower the demand a little bit? Like 10, 50, 100?

Though it does not look good for our multi-model system by running one type of job, I do want to get the plots as early as possible and make improvements later.

AmberLJC commented 1 year ago

why is the bug (devided by test_len) only apply to mobilenet?

And the test acc was right in fedscale, so i took some time look into it

EricDinging commented 1 year ago

There are two separate issues going on. One is NaN in mobilenet testing. The other is the low accuracy for the working resnet testing (and for other models as well). I think this is due to the test_len.

AmberLJC commented 1 year ago

but test_acc can still be used, right? I can see acc_5 is increasing

EricDinging commented 1 year ago

Yep, need to do some math afterwards.

So many things are happening at the same time... One bug fixed another came to mind...

EricDinging / Propius

Implement large simulation #8