Open atkinsonm opened 4 years ago
We can expose some metric through CloudWatch to measure this and terminate instances when they are idle. The key would be determining what that metric is. An oversimplified approach would just be to look at low CPU/GPU utilization.
There's no built-in termination policy for low CPU, so what we would need to do is:
I think the easier way might be by piggy-backing off https://github.com/MeanPug/folding-together/issues/20 to define a metric based on captured logs, then alert off that metric
Absolutely. If the CW logs can tell us whether or not the instance is processing a work unit, that will be a definitive answer.
If an instance is running and not picking up work units, either the F@H client is busted or the upstream is out and/or saturated. We probably want to have some type of cooldown such that if this situation is reached, we will back off for some amount of time and try again later.
I think it won't be difficult, here's a tail of some of the recent logs as pushed to the fah-node.txt
log group:
Processing
15:38:16
15:38:10:WU01:FS01:0x22:Completed 900000 out of 2000000 steps (45%)
15:39:38
15:39:32:WU01:FS01:0x22:Completed 920000 out of 2000000 steps (46%)
15:40:28
15:40:22:WU00:FS00:0xa7:Completed 80000 out of 500000 steps (16%)
15:40:57
15:40:51:WU01:FS01:0x22:Completed 940000 out of 2000000 steps (47%)
15:42:19
15:42:13:WU01:FS01:0x22:Completed 960000 out of 2000000 steps (48%)
15:43:38
15:43:32:WU01:FS01:0x22:Completed 980000 out of 2000000 steps (49%)
No Work
16:55:48
16:55:48:WU02:FS01:Connecting to 65.254.110.245:8080
16:55:48
[93m16:55:48:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration[0m
16:55:48
16:55:48:WU02:FS01:Connecting to 18.218.241.186:80
16:55:48
[93m16:55:48:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration[0m
16:55:50
[91m16:55:48:ERROR:WU02:FS01:Exception: Could not get an assignment[0m
I would like to have a way to measure whether an instance is actually running folding@home jobs. If an instance (or container) is running but sitting idle (because it hasn't received assignment, potentially do to upstream saturation), it's not useful to anyone and should be hibernated or terminated to save resources.