Expensify / App

Welcome to New Expensify: a complete re-imagination of financial collaboration, centered around chat. Help us build the next generation of Expensify by sharing feedback and contributing to the code.
https://new.expensify.com
MIT License
3.35k stars 2.77k forks source link

[HOLD for payment 2024-04-25] `macos-12-xlarge` runners are being canceled at a high rate #32212

Closed AndrewGable closed 5 months ago

AndrewGable commented 9 months ago

Problem

About 50% of our iOS builds that use the macos-13-xlarge runner are being canceled with the error:

[Build and deploy iOS](https://github.com/Expensify/App/actions/runs/7032789006/job/19137288520)
The hosted runner encountered an error while running your job. (Error Type: Disconnect).

Screenshot 2023-11-29 at 11 37 14 AM

Solution

Fix it

AndrewGable commented 9 months ago

Looks related: https://github.com/actions/runner-images/issues/7754

AndrewGable commented 9 months ago

GitHub confirmed this was something on their side and they are looking into it

AndrewGable commented 9 months ago

Been pretty quiet from GitHub support, I will bump them

melvin-bot[bot] commented 9 months ago

@AndrewGable Whoops! This issue is 2 days overdue. Let's get this updated quick!

AndrewGable commented 9 months ago

No update from GitHub support

melvin-bot[bot] commented 9 months ago

@AndrewGable Whoops! This issue is 2 days overdue. Let's get this updated quick!

melvin-bot[bot] commented 9 months ago

@AndrewGable 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

AndrewGable commented 9 months ago

We got a work around from GitHub support, but I am not sure we are seeing the error anymore. Looking into it today.

AndrewGable commented 9 months ago

GitHub says if we set --maxsockets=1 on npm install it should help, but I am not sure we want to do so with all the runners not failing.

melvin-bot[bot] commented 9 months ago

@AndrewGable Whoops! This issue is 2 days overdue. Let's get this updated quick!

melvin-bot[bot] commented 8 months ago

@AndrewGable Huh... This is 4 days overdue. Who can take care of this?

melvin-bot[bot] commented 8 months ago

@AndrewGable Now this issue is 8 days overdue. Are you sure this should be a Daily? Feel free to change it!

melvin-bot[bot] commented 8 months ago

@AndrewGable Now this issue is 8 days overdue. Are you sure this should be a Daily? Feel free to change it!

melvin-bot[bot] commented 8 months ago

@AndrewGable 10 days overdue. I'm getting more depressed than Marvin.

AndrewGable commented 8 months ago

I'll look back into this

melvin-bot[bot] commented 8 months ago

@AndrewGable Huh... This is 4 days overdue. Who can take care of this?

melvin-bot[bot] commented 8 months ago

@AndrewGable 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

melvin-bot[bot] commented 8 months ago

@AndrewGable 10 days overdue. I'm getting more depressed than Marvin.

AndrewGable commented 8 months ago

I think GitHub must have fixed it on their side, I haven't seen this happen in 2+ weeks.

kgantchev commented 5 months ago

Hi, not to be intrusive here, but it seems that this is a recurring issue here... have you considered giving FlyCI a try?

melvin-bot[bot] commented 5 months ago

📣 @kgantchev! 📣 Hey, it seems we don’t have your contributor details yet! You'll only have to do this once, and this is how we'll hire you on Upwork. Please follow these steps:

  1. Make sure you've read and understood the contributing guidelines.
  2. Get the email address used to login to your Expensify account. If you don't already have an Expensify account, create one here. If you have multiple accounts (e.g. one for testing), please use your main account email.
  3. Get the link to your Upwork profile. It's necessary because we only pay via Upwork. You can access it by logging in, and then clicking on your name. It'll look like this. If you don't already have an account, sign up for one here.
  4. Copy the format below and paste it in a comment on this issue. Replace the placeholder text with your actual details. Screen Shot 2022-11-16 at 4 42 54 PM Format:
    Contributor details
    Your Expensify account email: <REPLACE EMAIL HERE>
    Upwork Profile Link: <REPLACE LINK HERE>
AndrewGable commented 5 months ago

@kgantchev - Feel free to follow the proposal process, but no we haven't.

kgantchev commented 5 months ago

@AndrewGable thanks for sharing the proposal guide. I've created a proposal based on that guide.

The problem

Frequent GitHub failure at unsustainably high rates (close to 50%). This appears to be an infrastructure issue with a message indicating that the agent stopped responding:

The hosted runner encountered an error while running your job. (Error Type: Disconnect).

In addition to the runner failure ("disconnect"), the response time from GitHub support is too slow (up up to 8 days to resolve the issue).

What is the root cause of that problem?

The root cause is an infrastructure issue on GitHub's side. A complicating factor is GitHub's support, which is exceedingly slow with response times as slow as 8 days.

What changes do you think we should make in order to solve the problem?

A possible solution is to use FlyCI's macOS runners. FlyCI offers M2 runners ranging from 4 vCPUs to 8 vCPUs (macOS 13 and 14), with the largest being the flyci-macos-14-xlarge-m2 runner with 8 vCPUs and 14 GB RAM.

The FlyCI runners are highly reliable and are supported by a very responsive dev team. Support is available by e-mail and in the Discord server of FlyCI, with response rates that aim to always be below 24 hours.

The switch is simple:

Step 1: Install the FlyCI GitHub app and grant it permissions for this repo. Step 2: Switch the relevant runner label to point to FlyCI's labels.

In this case, there are 3 workflow files that have the offending runner label:

An example of the change looks like this for testBuild:

  iOS:
    name: Build and deploy iOS for testing
    needs: [validateActor, getBranchRef]
    if: ${{ fromJSON(needs.validateActor.outputs.READY_TO_BUILD) }}
    env:
      PULL_REQUEST_NUMBER: ${{ github.event.number || github.event.inputs.PULL_REQUEST_NUMBER }}
      DEVELOPER_DIR: /Applications/Xcode_15.0.1.app/Contents/Developer
-     runs-on: macos-13-xlarge
+     runs-on: flyci-macos-14-xlarge-m2

Note: the solution uses an M2 runner/macOS 14 (8 vCPU and 14 GB RAM), which should also provide a performance boost of about 20% compared to the M1 runners.

AndrewGable commented 5 months ago

Thanks for the proposal @kgantchev - I will consider this proposal, but probably will look at smaller solutions first remaining on GitHub Actions as we've standardized on GitHub runners and don't really want to splinter them across providers.

melvin-bot[bot] commented 5 months ago

@AndrewGable Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

AndrewGable commented 5 months ago

Going to see if macos-13-large helps, I believe xl might have been depreciated. This will still use intel CPUs as we don't want to use arm64.

melvin-bot[bot] commented 5 months ago

Reviewing label has been removed, please complete the "BugZero Checklist".

melvin-bot[bot] commented 5 months ago

The solution for this issue has been :rocket: deployed to production :rocket: in version 1.4.62-17 and is now subject to a 7-day regression period :calendar:. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2024-04-25. :confetti_ball: