GitHub workflow action failures for large feeds

bdferris-v2 commented 1 year ago

I've noticed an increased incidence of GitHub workflow action failures (link) for our end_to_end and acceptance_test workflows. Digging deeper into these failures, the validator itself isn't failing. Instead, our workflow runners are being terminated with the following message:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

The process itself includes the following in its logs:

Error: Process completed with exit code 143.

This is consistent with the process being terminated externally via SIGTERM.

So what's actually going on? Other folks are reporting the same issue to GitHub and it's pointed out that most of these runners are using ubuntu-latest, which has been upgraded to 22.04 over the past week (link). That potentially changes the resource envelope of the running worker, which might lead to resource exhaustion at different points than the previous runner.

Indeed, looking at our action failures, it seems to occur most often for large feeds such as no-unknown-agder-kollektivtrafikk-as-gtfs-1078 (11M stop-time entries) and de-unknown-ulmer-eisenbahnfreunde-gtfs-1081 (huge shapes.txt file).

So what do we do about this? Still digging into that.

But just to check, @aababilov and @asvechnikov2, your recent changes over the past month should have reduced memory usage for large feeds, not increased it, correct?

bdferris-v2 commented 1 year ago

As a follow-up, I've verified that heap usage doesn't look like it's changed significantly when validating against no-unknown-agder-kollektivtrafikk-as-gtfs-1078 with a build of the validator from October and from head. So no memory regression.

But it still takes a lot of heap to process these feeds. We max out at 8GB heap used for no-unknown-agder-kollektivtrafikk-as-gtfs-1078 specifically. Compare that with the quoted memory limit of 7GB for Linux images from GitHub (reference). It's not hard to imagine we are maxing that out and our runner is being terminated as result.

So what do we do about this? A couple of options:

1) We could exclude very large feeds from acceptance tests. Kind of a bummer, but maybe not the end of the world? 2) We could try to use larger runners? GitHub has support for larger runners, available to GitHub Team customers (I believe MobilityData is such a customer?). But it comes at a literal price. We'd need the 4-core + 16GB machine profile, which charges at $0.016/min. I'm not sure what that would add up to but I could try to run some numbers.

@isabelle-dr I think I'm going to need your input here. I specifically don't have access to the MobilityData orgs action usage stats but maybe you do? Could be useful to give us an estimate of what this might cost to migrate to the larger runners. We might be able to limit spend by only using large runners for acceptance tests and limiting their execution?

fredericsimard commented 1 year ago

So I've looked into our action usage stats, but there is still nothing. I believe it is due to MobilityData's status as a non-profit which gives us a free Team plan, therefore they don't cumulate any stats.

I've registered for the large runners, it's a beta program, and I'm waiting for a confirmation before these runners can be accessed. We'll have to see what costs it incurs I guess.

Did you try the ubuntu-20.04 label in the action so it runs in the previous OS?

Alternatively, we could host our own runner by creating a VM in Digital Ocean just for these tasks, but the costs could be important, between 48$ and 96$/month depending on the setup. I could check if we don't already have a running VM that could be used for this, since it's ephemeral in usage.

fredericsimard commented 1 year ago

Alright, that was fast. The large runners are ready, I believe the one you need is called ubuntu-latest-4-cores.

Runner group: Default Larger Runners Size: 4-cores · 16 GB RAM · 150 GB SSD Image: Ubuntu Latest (22.04) Public IP: Disabled

bdferris-v2 commented 1 year ago

@fredericsimard I can't access the link you posted. Just to confirm, is ubuntu-latest-4-cores the name of the runner group?

fredericsimard commented 1 year ago

Yes. You don't need to access the link, but yeah the name is correct. See capture.

fredericsimard commented 1 year ago

And I've removed the larger ones, leaving the classic one and the 4-core VM.

bdferris-v2 commented 1 year ago

I'm going to reopen this for some follow-up discussion.

In #1320, we modified our workflows to use runners with more memory to help process larger feeds. A couple thoughts from that update:

1) Wall-clock time for running our workflows seemed to be overall slower with this change (e.g. rule acceptance workflow took 1.5h while previously it had been running closer to 30 minutes). From what I can tell, our pool of available runners in the "large" pool seems to be smaller than the general runner pool, meaning than many of our tasks in the end-to-end and rule acceptance test workflows end up being queued because we don't have as much parallelism in task processing. I'm wondering if we can increase the size of our runner pool to address this, perhaps via auto-scaling? If we can't figure it out, I'm wondering if we should drop back to the default runner pool, excluding large feeds from our workflows, mostly to get faster turnaround times on our workflows.

2) We now have some idea of what the billed time will be for our workflows. For example, the rule acceptance workflow accrued 6h42m of billable time and the end-to-end workflow accrued 1h of billable time. That's 462 minutes total. Per GitHub pricing of $0.016/min (USD) for a 4-core Linux machine, that's $7.39 per workflow run. We had ~575 workflow runs in the past year, which would put annual spend at around $4250/year. As a sanity check, I want to make sure there aren't any budget concerns with these numbers.

Thoughts @isabelle-dr @fredericsimard ?

fredericsimard commented 1 year ago

...perhaps via auto-scaling?

I'm willing to try it. There's only one large runner in the runner group, so perhaps adding one more would help. But wouldn't it be the same cost? Two machines doing the same work in half the time as one?

We now have some idea of what the billed time will be for our workflows. For example, the rule acceptance workflow accrued 6h42m of billable time and the end-to-end workflow accrued 1h of billable time. That's 462 minutes total. Per GitHub pricing of $0.016/min (USD) for a 4-core Linux machine, that's $7.39 per workflow run. We had ~575 workflow runs in the past year, which would put annual spend at around $4250/year. As a sanity check, I want to make sure there aren't any budget concerns with these numbers.

That's a lot. Even a fully dedicated VM for this kind of task on DO would be 580..1200$/year. What I mean by that is we can run our own runner instead of GitHub's. For the price difference, I think it's worth it. There might be some work to make it run with Public repos I believe, I've seen something to that matter last night, would need to be confirmed.

bdferris-v2 commented 1 year ago

There's only one large runner in the runner group

Interesting, there does appear to be /some/ parallelization of jobs with the existing single runner, maybe using each of the 4 cores to run individual task?

But wouldn't it be the same cost? Two machines doing the same work in half the time as one?

That's my interpretation as well, but we'd want to double check the billing numbers after enabling. But I wouldn't stop at two runners, I'd like 4 or 5 to get our runtime numbers down.

For the price difference, I think it's worth it.

I'll ultimately leave that call up to you, but a couple of counterpoints:

1) If we can get scalability of runners from GitHub without increased cost, I think that makes GitHub a bit more price competitive to an equivalent number of self-hosted runners. 2) Keeping a VM maintained and up-to-date is not totally free. 3) Per self-hosted runner security, our current setup would mean someone could submit a pull request to run arbitrary code on our runner, which is a little scary. By comparison, GitHub VM runners are wiped between each job execution, which mitigates some of this. I've done some basic research, but I'm not sure how we'd mitigate this.

isabelle-dr commented 1 year ago

If we decided to run the acceptance tests run on demand before merging instead of on every commit, would this make a difference?

fredericsimard commented 1 year ago

There's only one large runner in the runner group

Interesting, there does appear to be /some/ parallelization of jobs with the existing single runner, maybe using each of the 4 cores to run individual task?

It's possible. I haven't seen anything about that in the documentation I've read so far.

But wouldn't it be the same cost? Two machines doing the same work in half the time as one?

That's my interpretation as well, but we'd want to double check the billing numbers after enabling. But I wouldn't stop at two runners, I'd like 4 or 5 to get our runtime numbers down.

We could test that yes. Let me know when you're ready to test this.

For the price difference, I think it's worth it.

I'll ultimately leave that call up to you, but a couple of counterpoints:

If we can get scalability of runners from GitHub without increased cost, I think that makes GitHub a bit more price competitive to an equivalent number of self-hosted runners. Yes, if scaling works, it makes more sense.

Keeping a VM maintained and up-to-date is not totally free. No, but it doesn't take very long. It's negligeable.

Per self-hosted runner security, our current setup would mean someone could submit a pull request to run arbitrary code on our runner, which is a little scary. By comparison, GitHub VM runners are wiped between each job execution, which mitigates some of this. I've done some basic research, but I'm not sure how we'd mitigate this.

Yes, I've read that to. Wouldn't the PRs requiring a reviewer be our line of defence in this case? I'm sure there is a way to wipe the VM after a run, surely we can replicate this behaviour in some form?

Also I've looked into the costs for the runner and its usage so far, we have: 1,178 minutes = $18.85 (with a 500$/month spending limit).

At that price I could've had a 2-core VM running for a month. This is an expensive service from GH.

bdferris-v2 commented 1 year ago

Wouldn't the PRs requiring a reviewer be our line of defence in this case? I'm sure there is a way to wipe the VM after a run, surely we can replicate this behaviour in some form?

It's not so simple. Our workflows are defined in our repository and in practice, a user can submit a pull request against those workflows and any changes in those workflows will be reflected immediately in the execution of those workflows, whether they have been reviewed or not. If this sounds like a flaw in GitHub, I'm inclined to agree with you.

So long-story-short, if we have a self-hosted runners associated with our public repository, then we cannot prevent someone from executing arbitrary commands on them. This is why GitHub advices against this setup.

fredericsimard commented 1 year ago

Yeah, makes sense. Let's do the test with multiple machines. Although the cost is really high for what it is, I'd need to get an approval for such a high amount. And demonstrate it fixes an issue we can't fix any other way, or for cheaper.

bdferris-v2 commented 1 year ago

As a follow-up, @fredericsimard bumped the max runner pool size from 10 => 20 yesterday. I ran a new instance of the Rule acceptance workflow (run link) with the updated runner count. Results:

Wall-clock execution time: 52m 42s
Billable time: 6h 42m

That's roughly the same billable time as the previous test runs in 43% less actual run time (not quite half, but close).

So main conclusion is that we can bump the size of the runner pool to get faster execution at about the same cost. I'd propose bumping to 30 :)

I also want to point out that there was some confusion earlier about how many runners we had. I had interpreted @fredericsimard 's comment above to mean we only had a single runner for our pool, when in fact we had 10.

So @fredericsimard to your point about self-hosting a VM, I'd actually be asking you to host 10-20 runners, not just 1-2. I feel like that makes self-hosted runners less competitive on price. Combine that with the security concerns and it seems like a non-starter to me. But maybe you disagree?

My take is that we are left with two options: 1) We continue to pay and use the larger runner pool. We can monitor our billed usage over time to make sure my initial estimates were accurate. 2) We could remove the handful of extremely large feeds from our test sets and go back to using default GitHub runners for free. 3) We try to go for some hybrid approach where we use option #2 for default workflow executions, but we introduce a secondary workflow that runs for a larger set of feeds (using the large runner pool) that we can manually trigger if we think it's warranted for a particular PR (e.g. changing validation in a new way). This could reduce billable costs while still allowing us better test coverage when we need it.

Again, I'm going to defer to @fredericsimard and @isabelle-dr on the cost + quality trade-offs here. Personally, I'm inclined to just go with #2 for now and potentially explore #3 as a follow-up. Thoughts?

fredericsimard commented 1 year ago

@bdferris-v2 I just want to understand why these tests were required in the first place: isn't it because the classic runners are just not up to the task and some automated tests never got completed? So the question is, we need large runners to complete those tests, or remove those tests, which is not the desired outcome, correct?

I can try and get a budget approved for ~5000$/year for this if we can justify it. So is it absolutely required, and if yes, why?

bdferris-v2 commented 1 year ago

There are two key workflows here:

Rule acceptance tests: runs on a large number of feeds from the Mobility Database to understand how a PR is adding / removing / modifying the set of validation notices generated for a corpus of feeds. This is important for capturing regressions introduced by a change or verifying that we aren't introducing breaking changes for existing feeds.
End-to-end test: this runs validation on a smaller subset of feeds just to make sure the validator isn't crashing as result of a particular PR.

These tests used to run more reliably but either due to changes in runner memory configuration (see discussion in opening comment) or increase in feed size, they were no longer reliably running. Thus the switch to larger runners.

These tests are important and we do not want to remove them entirely. However, as discussed above, we could remove some of the larger feeds from our test corpus, allowing us to go back to smaller runners, at the expense of reduced test coverage.

fredericsimard commented 1 year ago

Alright, I have my bi-weekly meeting tomorrow AM, will bring this up. Thanks!

bdferris-v2 commented 1 year ago

Any update on this?

fredericsimard commented 1 year ago

Oh yeah, sorry. I have a pre-approval, but we need more data in order to get a monthly spending limit set. Last time I check the bill was at 116 h, so less than 5 days, and 109$ USD has been accrued. I've set the limit to 500$ and I'll keep on checking on it once every few days.

For now, keep going as normal so we can get real numbers.

davidgamez commented 2 weeks ago

@bdferris-v2 @fredericsimard I think we can close this as a few actions have occurred since it was open, and for the past year, the E2E and acceptance tests actions have been executed reliably. We will continue monitoring the GitHub actions and bills. cc: @emmambd

Optimization
- Increased Java heap memory #1834
- GitHub Cost optimization, decreased the maximum number of runners from 256 to 60(max allowed in our organization) #1904
Monitoring
- Add consumption time metrics to acceptance test #1771
- Add memory to acceptance test #1874
Large feeds exclusion list, https://github.com/MobilityData/gtfs-validator/blob/665cbc6ff88bbf8694aa0002c9328b4900992ed0/scripts/mobility-database-harvester/harvest_latest_versions.py#L34
Pending
- 1910

fredericsimard commented 2 weeks ago

@davidgamez Completely agree. Closing!

MobilityData / gtfs-validator

GitHub workflow action failures for large feeds #1304

1910