CesiumGS / cesium-unreal

Bringing the 3D geospatial ecosystem to Unreal Engine
https://cesium.com/platform/cesium-for-unreal/
Apache License 2.0
916 stars 292 forks source link

Switch to GitHub hosted runners #1323

Closed kring closed 6 months ago

kring commented 8 months ago

~This is just a test, don't merge it.~ It's ready now.

We've been using self-hosted runners to do Unreal builds for awhile now. Mostly it works fine, but it can be a hassle to maintain the system that manages the infrastructure, and we sometimes see truly awful performance from builds (for unknown reasons).

So this PR switches to using GitHub-hosted large runners instead. Only the UE 5.1 Windows build is hooked up for the moment. Running on a generic build image rather than our custom ones requires some extra steps to happen during the build:

  1. Install Visual Studio 2019, because the runner only includes 2022.
  2. Download and extract Unreal Engine 5.1

These add time to the build (about 15-20 minutes), and also (2) adds significant cost because UE is huge and downloading it from S3 on each build is expensive. Hosting it on Azure instead should fix that, though (as GitHub runners run on Azure and same-cloud downloads are usually free).

Overall this works really well. Time to build start is much shorter (~2 minutes instead of 5-6). The build itself is somewhere between a little and a lot faster, which is pretty mind boggling because the build in this PR does a lot more work, and our self-hosted instances are similar to or faster than the GitHub hosted instance. But, as mentioned above, we truly have no idea why the performance of our EC2 instances is so astoundingly slow in the self-hosted case. Could be we're doing something wrong. Or perhaps AWS performance in this sort of use-case (spin up a new instance, run a single build, shut it down) is just really terrible compared to what GitHub gets with Azure? More on the build slowness here: CesiumGS/cesium-unreal#1192

So this is looks pretty viable from a purely technical perspective. From a cost perspective, though, I'm worried.

Our self-hosted Windows instances cost 78 to 99.2 cents per hour on-demand (depending on exactly which instance we use). We use spot instances so the actual cost is lower (that 99.2 cents is currently 59.94 cents as a spot instance). There are all 8 core machines with 32GB+ of memory and a local SSD.

The 8-core, 32GB GitHub-hosted Windows instances instead cost 6.4 cents per minute, or 384 cents per hour. This is almost 4 times the on-demand price of our most expensive runner type, and 6.5 times its current spot price. Even if it cuts build times in half, we'll still be paying a lot more. And - again - it's mind-boggling that the GitHub-hosted instances are faster! The self-hosted runners are more powerful machines, doing less work! And yet, they are.

To ballpark it a bit, each Unreal commit kicks off 15 builds (5 platforms times 3 UE versions), plus a test and package for each version (6 more total). That could easily be as much as 21 hours of compute time for each commit, or $80.64 in total, per commit! Yikes. If we could get similar performance with the self-hosted setup, the cost would only be $12.59 per commit (based on spot instances).

CC @mramato

mramato commented 8 months ago

But, as mentioned above, we truly have no idea why the performance of our EC2 instances is so astoundingly slow in the self-hosted case

I know very little about the self-hosted setup you've been using, but have you considered disk i/o at the bottleneck on AWS? I find it is one of the most overlooked aspects for this type of thing and by default, AWS is set to burst mode, which means you get fast performance for a short amount of time and then things slow to a crawl. Provisioned i/o ops are faster and guaranteed. I don't know how they play with spot instances, which are themselves not ideal for a CI system in my opinion.

Or perhaps AWS performance in this sort of use-case (spin up a new instance, run a single build, shut it down) is just really terrible compared to what GitHub gets with Azure?

I would be surprised if GitHub is spinning up a new instance every time, they are almost certainly using containers of some time and the time it takes to spin up a new container is way less than spinning up a new instance (the same is true on AWS).

That could easily be as much as 21 hours of compute time for each commit, or $80.64 in total, per commit! Yikes. If we could get similar performance with the self-hosted setup, the cost would only be $12.59 per commit (based on spot instances).

Both of these numbers seem crazy to me. Like wouldn't it be way cheaper to just use a self-hosted runner on a machine we physical own at this point? (biggest issue there is the maintenance cost, so probably not).

kring commented 8 months ago

I know very little about the self-hosted setup you've been using, but have you considered disk i/o at the bottleneck on AWS?

Yes, and I agree that's almost certainly the problem. I wrote what I know about it, and requested help with it, back in August: https://github.com/CesiumGS/aws/issues/296

We're using GP2 volumes, which I think avoids the burst behavior. But maybe it also limits the peak performance we get. In any case, I'm pretty sure the main problem isn't the performance of the EBS volume itself (our working area is on a local SSD anyway, not on an EBS volume), but rather the time to "warm up" the EBS volume by streaming the system disk from the snapshot on S3 for a brand new VM.

"For volumes that were created from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume." (from here: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ebs-initialize.html)

I would be surprised if GitHub is spinning up a new instance every time, they are almost certainly using containers of some time and the time it takes to spin up a new container is way less than spinning up a new instance (the same is true on AWS).

Well, I don't know, but this page seems to say otherwise: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#using-a-github-hosted-runner

"When the job begins, GitHub automatically provisions a new VM for that job. All steps in the job execute on the VM, allowing the steps in that job to share information using the runner's filesystem. You can run workflows directly on the VM or in a Docker container. When the job has finished, the VM is automatically decommissioned."

They might have sufficient scale to have instances already spun up and ready to run before they're requested. We could do that too, but it would be costly. But again, 2 minutes versus 5 minutes startup time isn't really what I'm worried about here.

Both of these numbers seem crazy to me. Like wouldn't it be way cheaper to just use a self-hosted runner on a machine we physical own at this point? (biggest issue there is the maintenance cost, so probably not).

Yes, it's a lot of money, but it's not obvious what we can do about it. On-premise hardware is always cheaper (by a lot!) than the cloud if you don't count the cost of maintaining the on-premise hardware. Whether we would save money with that approach would mostly come down to how reliable the hardware is, probably.

mramato commented 8 months ago

Sorry, I missed that you were using snapshots. Yes, that is crazy slow in my experience and I think you are correct that it is likely the primary issue.

Yes, it's a lot of money, but it's not obvious what we can do about it. On-premise hardware is always cheaper (by a lot!) than the cloud if you don't count the cost of maintaining the on-premise hardware. Whether we would save money with that approach would mostly come down to how reliable the hardware is, probably.

This is a question I struggle with as well.

Re: Azure. Putting the zip on Azure for this should be straightforward, we can do that early January if you want to try it out and see.

kring commented 8 months ago

Sorry, I missed that you were using snapshots. Yes, that is crazy slow in my experience and I think you are correct that it is likely the primary issue.

I don't know of a way to avoid using snapshots, though. I'd love to hear about it if there is such a way.

Re: Azure. Putting the zip on Azure for this should be straightforward, we can do that early January if you want to try it out and see.

I'm out the first week, back on the 8th. Would love to set that up then.

kring commented 8 months ago

As another experiment, I added this line to the start of the self-hosted build:

dd if=\\.\PHYSICALDRIVE0 of=nul bs=1G --progress --size

This is mentioned on this page as a way to make sure the EBS volume is warmed up and gets its full performance.

It takes an hour to run that command, so this is totally unworkable, but I expected the build performance to be really good after it completed, at least. Strangely enough, though, the build performance is still much slower than the github-hosted runners. After the hour-long warmup, building cesium-native took 7m 46s. But it only took 4m 42s on the GitHub-hosted instance. Similarly, building the plugin itself took 41m 28s on self-hosted instead of 25m 28s on GitHub.

So unless I've somehow completely failed at using dd to warm up the EBS instance, I think this tells us that the "restore from snapshot" time isn't the main issue here. I don't know what the issue is, though. Some possibilities:

  1. This was running on an i3.2xlarge EC2 instance. That's a relatively old CPU I guess (came out in 2016). I couldn't find any information about what GitHub uses, maybe it's just that much faster?
  2. Is the EBS performance, even when warmed up, still super slow for some reason? Is GP2 a poor choice? Have I unwittingly created a really low-performance volume?
  3. We're running the build on the VM's local SSD. Is that really slow for some reason? (seems crazy, but who knows)
kring commented 7 months ago

GitHub just announced that the default, free runner for public repos has been upgraded to 4 vCPUs, 16 GiB of RAM, and 150GiB of storage. So it's probably possible to do Unreal builds for free now, and hopefully even with decent performance. 🤩

shehzan10 commented 7 months ago

@kring Great update. Would this require downloading and unpacking Unreal Engine during the build? I suppose we can store it as an image right?

kring commented 7 months ago

Would this require downloading and unpacking Unreal Engine during the build?

Yes, it would. We definitely need to get the Unreal images on Azure in order to avoid huge costs.

I suppose we can store it as an image right?

Not sure what you mean here?

shehzan10 commented 7 months ago

I suppose we can store it as an image right?

Not sure what you mean here?

We can't create an AMI-type image with Unreal already unpacked so reduce the time for the download+unpack step in GitHub Action runners.

kring commented 7 months ago

There's no way to use a custom image on GitHub-hosted runners AFAIK. A (Windows) container could be a possibility, but that'll only be a win if the container image is smaller than the ZIP, cause we'd still have to download it (again, AFAIK).

kring commented 7 months ago

Based on just one sample, the upgraded small runners seem much slower than the large ones (which isn't too surprising, of course). Its's harder to compare to the self-hosted runners. When the self-hosted runners are at their best, they're significantly faster than this (under an hour rather than the 1 hour 24 minutes we see here). However, when they're slower (for unknown reasons!) they're significantly slower than this. Considering the upgraded small runners are free, less maintenance than the self-hosted, and likely to at least be pretty consistent in how long they take to do a build, that's probably a win overall.

Step Large Runner Time Small Runner Time Notes
Install VS2022 471s 874s Can probably eliminate this step entirely by using windows-2019 instead of windows-latest
Download Unreal Engine 106s 251s
Unzip Unreal Engine 481s 804s
Build cesium-native 282s 360s
Build CesiumForUnreal plugin 1528s 2610s
Upload plugin artifact 350s 36s Large runner used upload-artifact@v3, small used v4, so probably can't compare
Total 55m 20s 1h 24m 15s (the times do not add up to the total because there are some small steps I omitted)
kring commented 6 months ago

This is working well and ready for review. It's using all the standard runner types, which are free for open source, so this should save us a lot of money. @mramato if you can hook me up with some Azure account credentials I should be able to drive our CI cost to zero with minimal effort. Right now it might still be semi-high because of the massive amount of data we download from S3 on every build.

The Windows and Linux builds are pretty performant. I had to do some kind of crazy things (uninstall stuff we don't need, mostly) to make room for Unreal Engine on the Linux VMs, though, because they're very low on disk space, but it's working well enough.

The macOS builds are slow, especially because we can only run 5 at a time and a single Unreal commit needs 6. I tried at one point to use the new M1 runners. They had amazing performance for building cesium-native, but they're so stupidly constrained on memory (only 7 GB, versus 14 GB for the macOS Intel runners!) that our Unreal builds took approximately forever. A lot of the problem is that Clang (unlike Visual Studio) uses silly amounts of memory when compiling the templates in the metadata system. It'd be nice to do something about this, but it won't be quick or easy (I know because I've already spent a fair bit of time trying).

kring commented 6 months ago

I'm merging this because, as imperfect as it may be, it's better than what's in main. And lots of other branches are failing in dodgy ways that are very likely to be fixed by this one.