MeanPug / folding-together

Democratizing folding@home (and potentially other networks like rosetta@home)
MIT License
1 stars 0 forks source link

Measure instance efficiency #6

Open atkinsonm opened 4 years ago

atkinsonm commented 4 years ago

I would like to have a way to measure whether an instance is actually running folding@home jobs. If an instance (or container) is running but sitting idle (because it hasn't received assignment, potentially do to upstream saturation), it's not useful to anyone and should be hibernated or terminated to save resources.

atkinsonm commented 4 years ago

We can expose some metric through CloudWatch to measure this and terminate instances when they are idle. The key would be determining what that metric is. An oversimplified approach would just be to look at low CPU/GPU utilization.

atkinsonm commented 4 years ago

There's no built-in termination policy for low CPU, so what we would need to do is:

steinbachr commented 4 years ago

I think the easier way might be by piggy-backing off https://github.com/MeanPug/folding-together/issues/20 to define a metric based on captured logs, then alert off that metric

atkinsonm commented 4 years ago

Absolutely. If the CW logs can tell us whether or not the instance is processing a work unit, that will be a definitive answer.

If an instance is running and not picking up work units, either the F@H client is busted or the upstream is out and/or saturated. We probably want to have some type of cooldown such that if this situation is reached, we will back off for some amount of time and try again later.

steinbachr commented 4 years ago

I think it won't be difficult, here's a tail of some of the recent logs as pushed to the fah-node.txt log group:

Processing


15:38:16
15:38:10:WU01:FS01:0x22:Completed 900000 out of 2000000 steps (45%)

15:39:38
15:39:32:WU01:FS01:0x22:Completed 920000 out of 2000000 steps (46%)

15:40:28
15:40:22:WU00:FS00:0xa7:Completed 80000 out of 500000 steps (16%)

15:40:57
15:40:51:WU01:FS01:0x22:Completed 940000 out of 2000000 steps (47%)

15:42:19
15:42:13:WU01:FS01:0x22:Completed 960000 out of 2000000 steps (48%)

15:43:38
15:43:32:WU01:FS01:0x22:Completed 980000 out of 2000000 steps (49%)

No Work


16:55:48
16:55:48:WU02:FS01:Connecting to 65.254.110.245:8080

16:55:48
16:55:48:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration

16:55:48
16:55:48:WU02:FS01:Connecting to 18.218.241.186:80

16:55:48
16:55:48:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration

16:55:50
16:55:48:ERROR:WU02:FS01:Exception: Could not get an assignment