matt-graham commented 1 month ago

To apply to

Scheduled profiling workflows
All comment triggered workflows

giordano commented 1 month ago

/run dummy-job

ucl-comment-bot[bot] commented 1 month ago

Dummy job for testing workflow succeeded ✅

🆔 25109039407 ⏲️ 0.00 minutes

️⃣ a95e91689dd72cb1dc3dc365b23b0312cf83e9c4

giordano commented 1 month ago

/run dummy-job

giordano commented 1 month ago

/run dummy-job

ucl-comment-bot[bot] commented 1 month ago

Dummy job for testing workflow succeeded ✅

🆔 25111065781 ⏲️ 0.00 minutes

️⃣ 67c80589a2e0f274ac12c87cd76876df3010d51d

ucl-comment-bot[bot] commented 1 month ago

Dummy job for testing workflow succeeded ✅

🆔 25111187558 ⏲️ 0.00 minutes

️⃣ 67c80589a2e0f274ac12c87cd76876df3010d51d

giordano commented 1 month ago

Maybe let's wait for the scheduled profiling jobs to run tomorrow, but this should be working after #1358.

matt-graham commented 1 month ago

The scheduled profiling workflow failed unfortunately. Weirdly it seems to have successfully completed running the profiling and saving the profiling output post-run, but the Run profiling in dev environment step is still shown with an 'in-progress' yellow/amber spinner (though the overall job shows as failed) and the subsequent Save results as artifact step did not appear to start running 😕

giordano commented 1 month ago

Since the overall build stopped after almost exactly 12 hours, I wonder if there's a (hopefully configurable) 12-hour timeout somewhere in the ARC configuration.

giordano commented 1 month ago

Quick comment: I can confirm the virtual machine is down, I still have this feeling there are some settings somewhere taking down a VM after 12 hours, but I have no idea where this is coming from, I can't see relevant settings anywhere.

Side note, looking at the timings

[00:22:16:INFO] Starting profiling runs
[12:03:13:INFO] Profiling runs complete

this seems to be have taken much longer than previous jobs, is this concerning?

giordano commented 1 month ago

Ok, I did some more investigation:

the 12-hour mark was probably a nice coincidence, but a red herring: I ran a 14-hour dummy job and it ended without problems
prompted by an offline discussion with Matt I looked into the possibility of the machine being killed due to an OOM: I started a job which would allocate a total of 10 GiB of memory (we're currently using Standard_DS2_v2 machines which have 7 GiB of memory), and we have similar symptoms to the scheduled profiling job which died the other day:
- the step is seemingly still running in the GitHub UI
- there's no message whatsoever printed to screen about errors or anything
- in the Kubernetes log I see similar error messages around the times when the jobs supposedly died:
```
2024/05/18 12:03:18 http: TLS handshake error from 10.244.0.19:32960: EOF
[...]
2024/05/21 10:30:45 http: TLS handshake error from 10.244.0.19:49832: EOF
```
  I don't see similar messages in the log. While this error message isn't exactly clear, the coincidence with the two events is remarkable. Maybe this means the Kubernetes service is trying to contact the machines but not getting an answer back?

All in all, my understanding is that the failure we've seen is indeed due to an OOM, which isn't unlikely due to the workload, according to both Matt and Will. Note that the new autoscaling runners the aforementioned (dedicated) Standard_DS2_v2 machines, while previously they were running on Standard_F16s_v2 machines which have 32 GiB of memory (although these are shared machines with other workflows, scheduled jobs on Saturday are likely running at a quiet time). Side note, the Standard_DS2_v2 vs Standard_F16s_v2 difference should also explain why the job took longer with the new setup. I'm not really sure what we can do: reduce the profiling workload to make it fit in a Standard_DS2_v2 box, or get a pool of beefier machines (maybe one of the memory-optimised machines, CC @tamuri)?

giordano commented 1 month ago

or get a pool of beefier machines (maybe one of the memory-optimised machines, CC @tamuri)

Actually, this may not be a too bad option: I'm comparing different machines with the Azure Pricing Calculator and unless I'm reading it wrong, Standard_E2_v4 (16 GiB of memory for 2 quite recent vPCUs) seems to be slightly cheaper than Standard_DS2_v2 (7 GiB of memory for 2 older generation vCPUs) (see also the page about pricing of virtual machines)

tamuri commented 1 month ago

We use the Standard D11 v2 for batch-submit --more-memory option. The E-series doesn't come with any disk storage.

tamuri commented 1 month ago

But we should check we're using the machine with best value for money.

giordano commented 1 month ago

The E-series doesn't come with any disk storage.

Yeah, I noticed the machines I suggested don't have temporary storage after I posted the message, the VM pricing page is much clearer about this, but some E-series machines do have temporary storage, just not all of them.

Here's a comparison of D2 v2 with some of the memory-optimised machines (pricing refers to UK South region), all of them have 2 vCPUs:

Name	CPU	Memory	Temporary storage	Cost ($/month)
D2 v2	Intel® Xeon® Platinum 8272CL processor (second generation Intel® Xeon® Scalable processors), Intel® Xeon® 8171M 2.1GHz (Skylake), Intel® Xeon® E5-2673 v4 2.3 GHz (Broadwell) or the Intel® Xeon® E5-2673 v3 2.4 GHz (Haswell) processors	7 GiB	100 GiB	128.4800
E2ads v5	AMD EPYC 7763v	16 GiB	75 GiB	112.4200
E2a v4	2.35 Ghz AMD EPYC 7452	16 GiB	50 GiB	108.0400
E2pds v5	Ampere® Altra® Arm-based	16 GiB	75 GiB	98.5500

D2 v2 seem to be a bit expensive overall, maybe because they're in the general purpose category and in high demand? E2a v4 isn't bad if 50 GiB of local storage is enough, and we could even consider E2pds v5 if using ARM CPUs is an option.

giordano commented 1 month ago

For the record, I set up a new autoscaling Kubernetes cluster with Standard_E2_v4 machines, which gives us better CPUs than Standard_DS2_v2, more memory, for less money but also with less storage (which shouldn't be a problem for our use case though). I restarted the scheduled job that failed on Saturday, and this time it was successful in about 8 hours, a time close to previous runs. So I think this is a net improvement compared to the first setup I attempted.

The only thing is that for reason I still don't understand we can't use more than 12GiB of memory even though the machine has nominally 16 GiB (more likely something between 15 and 16, but definitely larger than 12): when I try to use there values larger than 12 GiB (e.g. 14) no GitHub Actions runner is ever started at all because of insufficient memory. But in any case 12 GiB should be plenty of memory for the profiling jobs and similar workloads: I had a look at past runs of profiling jobs, and maximum total memory usage on the node was about 4 GiB. Edit: the ~12 GiB limit seems to come from AKS: the maximum allocatable memory on the node is about 12 GiB, even if the nodes we're requesting have 16 GiB:

% kubectl describe node aks-agentpool-...
[...]
Allocatable:
  cpu:                1900m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12881168Ki
  pods:               110

This would also explain why I can't put more than 12 GiB in the spec.resources.requests.memory property of the Kubernetes cluster.

giordano commented 1 month ago

Alright, I finally figured out the problem with missing memory: AKS forcibly restricts the total amount of allocatable memory, to reserve some space on the node to the Kubernetes service. At the moment, the formula used to restrict the available memory depends on the version of Kubernetes used, and in particular the formula used for Kubernetes v1.28- reserves much more memory than the one used with Kubernetes v1.29. We're currently using v1.28 (latest "stable" version in AKS), while v1.29 should become "stable" around August-September according to AKS docs about supported kubernetes versions.

At the moment with Kubernetes v1.28 we have

 % kubectl describe node aks-agentpool-16076257-vmss000000
[...]
Capacity:
  cpu:                2
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16375056Ki
  pods:               110
Allocatable:
  cpu:                1900m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12881168Ki
  pods:               110

Note the large difference between memory capacity and allocatable memory, over 20%! Let's do some maths:

julia> total_memory = 16375056
16375056

julia> hard_eviction = 750 * 2 ^ 10
768000

julia> kube_reserved = round(Int, 0.25 * 4 * 2 ^ 20 + 0.20 * 4 * 2 ^ 20 + 0.10 * (total_memory - 8 * 2 ^ 20))
2686082

julia> allocatable_memory = total_memory - (hard_eviction + kube_reserved)
12920974

I tried to follow the formula reported in the docs for Kubernetes v1.28-, I got an expected allocatable memory of 12920974 KiB, in reality it's 12881168 KiB, but it's close enough (error is about 0.3%).

With Kubernetes v1.29 and 50 maximum pods we expect to have:

julia> total_memory = 16375056
16375056

julia> k129_reserved(max_pods) = (100 * 2 ^ 10 + max_pods * 20 * 2 ^ 10)
k129_reserved (generic function with 1 method)

julia> total_memory - k129_reserved(50)
15248656

Also in this case the actual allocatable memory reported by Kubernetes is off by ~0.3% (sorry, I don't have the precise number to share!), but the number above is good enough for ballpark estimation. I tried this setup and ran a test CI job where I allocated an array of about 13 GiB, which together with the baseline of used memory on the node brought total usage to well over 14 GiB: This amount of memory usages would systematically OOM the machine in all my previous attempts when using Kubernetes 1.28, so this is a significant improvement, especially in my understanding of Kubernetes and AKS :sweat_smile: In any case, at the moment I don't think we're concerned about running out of memory with 12 GiB of allocatable memory, so I think we can stay with Kubernetes 1.28.

To summarise, now we have an AKS deployment with the following properties:

Standard_E2_v4 machines: cheaper, faster, and with less storage than Standard_DS2_v2 (I think storage is a non-negligible fraction of the cost of this service)
automatic upgrade to latest "stable" version of Kubernets according to AKS policy. Around the end of the summer we should get automatically Kubernetes 1.29, which would also enable using more memory on a single node
maximum 50 pods (in normal conditions I haven't seen pods with more than 16 pods, so 50 should be plenty)
maximum 8 concurrent runners

tamuri commented 1 month ago

This is great, thanks for digging into it. Wonder whether we can move the Batch pool VMs over to this too. Everything is bundled into a docker container, not sure how much storage we'll need.

UCL / TLOmodel

Set up workflows to run on Kubernetes / autoscale runners #1356

Dummy job for testing workflow succeeded ✅

️⃣ a95e91689dd72cb1dc3dc365b23b0312cf83e9c4

Dummy job for testing workflow succeeded ✅

️⃣ 67c80589a2e0f274ac12c87cd76876df3010d51d

Dummy job for testing workflow succeeded ✅

️⃣ 67c80589a2e0f274ac12c87cd76876df3010d51d