deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.84k stars 313 forks source link

Shorten the waiting time of GitHub CI #1101

Open Nick-0314 opened 1 year ago

Nick-0314 commented 1 year ago

Feature request

Shorten the waiting time of GitHub CI

Use case

Nick-0314 commented 1 year ago

@aktech Hello, We have fully launched cirun.io and abandoned the original runner, and found some problems today. We think the current one and a half minute response of github CI is a bit slow, is there a way to speed it up? For example, do some preset software in ami and so on? In addition, we have a problem that the runner requested by CI is taken away by other CI. Is there a good solution?

aktech commented 1 year ago

We think the current one and a half minute response of github CI is a bit slow, is there a way to speed it up? For example, do some preset software in ami and so on?

Hey @mytting yes you can create custom AMI with some of the software already installed like say docker, etc. That would speed up your overall CI time. The provision time wouldn't be affected much as it's mainly just calling AWS's API spinning up a VM and installing Git Actions, installation doesn't take much time, like less than 15-20 seconds. Most of the time is spent getting a VM from AWS. I can try to take a look if there are any bottlenecks which can be improved.

In addition, we have a problem that the runner requested by CI is taken away by other CI. Is there a good solution?

What do you mean by other CI? Do you mean other jobs? Runners are are picked up by GitHub Action workflows by runner labels, which is controlled by:

runs-on: cirun-aws-amd64-32c

If you want them to be unique I can look on implementing spinning up runners by run_id, then you could do something like:

# Not implemented yet
runs-on: "cirun-aws-amd64-32c--${{ github.run_id }}"

will that help?

Nick-0314 commented 1 year ago

cirun-aws-amd64-32c--${{ github.run_id }}" Yes, I want this effect. Is there anything you need to do?

@aktech

Nick-0314 commented 1 year ago

When will that be possible?

aktech commented 1 year ago

Yes, I want this effect. Is there anything you need to do?

Yes I need to implement it. You should have it within a few days (maximum: a week). I'll implement it at share the documentation link here.

Nick-0314 commented 1 year ago

Yes, I want this effect. Is there anything you need to do?

Yes I need to implement it. You should have it within a few days (maximum: a week). I'll implement it at share the documentation link here.

ok, wait for the good news. Does github action support this syntax?

Nick-0314 commented 1 year ago

Another restriction is that the runner label must begin with cirun, as if it is not mentioned in the documentation @aktech

Nick-0314 commented 1 year ago

Yes, I want this effect. Is there anything you need to do?

Yes I need to implement it. You should have it within a few days (maximum: a week). I'll implement it at share the documentation link here.

ok, wait for the good news. Does github action support this syntax?

Oh, I just tried it. github supports this syntax

aktech commented 1 year ago

Another restriction is that the runner label must begin with cirun, as if it is not mentioned in the documentation

Thanks for pointing that out, I'll update that in the documentation, apologies for the inconvenience. Yes, that's important because its make my life easier to filter webhook events, where runner needs to be created, otherwise it would have been tricky.

ok, wait for the good news. Does github action support this syntax? Oh, I just tried it. github supports this syntax

Yep, I tried it as well. You would hear from me soon. :)

Nick-0314 commented 1 year ago

It seems that the spot instance defining multiple regions and multiple specifications in the.cirun file does not work, and the spot request often appears open, 'no Spot capacity available', at which point cirun considers the creation successful. @aktech

aktech commented 1 year ago

Yes, that's an outstanding bug. It will be fixed in the next release.

Nick-0314 commented 1 year ago

Does cirun support google cloud? aws spot instances are billed by the hour, one hour minimum, our CI usually runs for about 10 minutes, I understand that gcp is billed by the minute, @aktech

aktech commented 1 year ago

Does cirun support google cloud?

Yes, it does.

aws spot instances are billed by the hour, one hour minimum,

Are you sure? to me it seems like you're charged for seconds used: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/billing-for-interrupted-spot-instances.html

Nick-0314 commented 1 year ago

cirun 支持谷歌云吗?

是的,它确实。

aws Spot 实例按小时计费,最少一小时,

你确定吗?对我来说,您似乎需要为使用秒数付费:https ://docs.aws.amazon.com/AWSEC2/latest/UserGuide/billing-for-interrupted-spot-instances.html

I'll check with the sales staff. I think the bill is by the hour

aktech commented 1 year ago

I'll check with the sales staff. I think the bill is by the hour

Quite strange, let me know if you hear from them.

Nick-0314 commented 1 year ago

Quite strange, let me know if you hear from them.

Just confirmed that the bill is by the second, was misled by some pages of aws

Nick-0314 commented 1 year ago

@aktech Hi ,has there been any progress recently?

aktech commented 1 year ago

Hey @mytting not yet, I’m travelling at the moment. Expect it by the end of this week.

Nick-0314 commented 1 year ago

ok

aktech commented 1 year ago

Unique runner labels is available now: https://docs.cirun.io/reference/unique-runner-labels

Nick-0314 commented 1 year ago

Recently, some open source projects came to our aws spot to ask us what to do. We recommended cirun. We will promote cirun recently. cirun has solved many of our pain points. It's great.

Nick-0314 commented 1 year ago

@aktech Hi I have raised an issue to the GitHub documentation, hoping that excellent projects like Cirun can be added to the GitHub documentation, and users will avoid detours https://github.com/github/docs/issues/21697

In addition, now the job is basically fixed for 90 seconds, and the time may be a bit long. Is there a template for user-data? We can find aws people to see if there is any way to optimize it.

aktech commented 1 year ago

Hey @mytting thanks a lot for that, I really appreciate it. There isn't a specific template for the same, but it's fairly simple, which is pulling the github actions runner software and installing it and creating a user and it doesn't take much time really, for example, here are the logs of a one of the random runners on this repo, it took about 15 seconds for the user data script to run.

My suspicious is on AWS, the time they take to hand over a VM is quite slow. Let me know if you have more questions. I am happy to jump on a call with you and AWS to see where the bottlenecks are.

Nick-0314 commented 1 year ago

ok ,I have a general understanding, is there a general script? How to pass it to EC2, I can let AWS people debug it

Nick-0314 commented 1 year ago

Hey @mytting thanks a lot for that, I really appreciate it. There isn't a specific template for the same, but it's fairly simple, which is pulling the github actions runner software and installing it and creating a user and it doesn't take much time really, for example, here are the logs of a one of the random runners on this repo, it took about 15 seconds for the user data script to run.

My suspicious is on AWS, the time they take to hand over a VM is quite slow. Let me know if you have more questions. I am happy to jump on a call with you and AWS to see where the bottlenecks are.

It seems that after the runner is registered to become the idle state, it will switch to the offline state, and then it will become the Active state

Nick-0314 commented 1 year ago

I tried it. It took about 25 seconds from creating EC2 to being able to ssh. Is there a network reason for downloading the product?

aktech commented 1 year ago

ok ,I have a general understanding, is there a general script? How to pass it to EC2, I can let AWS people debug it

I can try to create one for you.

It seems that after the runner is registered to become the idle state, it will switch to the offline state, and then it will become the Active state

Ah, interesting.

I tried it. It took about 25 seconds from creating EC2 to being able to ssh. Is there a network reason for downloading the product?

Did you create it via API? Can you share the script? If that's the case then it might be something on our end. I am happy to take a look, later this week.

Nick-0314 commented 1 year ago

Manually created... ..

Nick-0314 commented 1 year ago

I mean I created it manually and didn't pass in user-data?

aktech commented 1 year ago

I mean I created it manually and didn't pass in user-data?

Ah, ok. I'll debug ours and will let you know.

Nick-0314 commented 1 year ago

ok

Nick-0314 commented 1 year ago

title: DeepFlow Accelerates GitHub Action Exploration Using Spot Instances date: 2022/11/01 author: Song Jianchang avatar: cover: https://yunshan-guangzhou.oss-cn-beijing.aliyuncs.com/pub/pic/20221027635a6171c75b3.png excerpt:

Github Action makes the CI process of projects hosted in Github very convenient, but the Runner configuration of 2C7G provided by Github by default is too low to run some large-scale project compilation tasks. This article is an exploration of DeepFlow using public cloud high-end cheap Spot instances to accelerate Action , After a series of stepping on the pit, we finally found an ideal solution to solve all the needs of performance, cost, ARM, etc. I hope it will be useful to you.

0x0: Problems with GitHub Actions

Since the DeepFlow open source code was pushed to GitHub, we encountered the problem that GitHub Action took too long to compile tasks due to the low managed Runner configuration. Before that, the Alibaba Cloud 32C ECI Spot instance used by our internal GitLab CI took a few minutes. You can run all the jobs (the specific method will be introduced in a separate article later). After seeing the changes brought by the Spot instance to our GitLab CI, the GitHub Action of DeepFlow has been looking for a way from the first day of its launch. Each Job is assigned an independent Runner and supports the solution of the X86/ARM64 architecture, but this process is not smooth, and after 5 versions before and after, an ideal solution is finally found.

Some problems encountered by DeepFlow in the early stage of using GitHub Action:

  1. Poor performance: The GitHub hosting machine configuration is too low and the compilation phase takes too long
  2. Poor flexibility: The fixed self-hosted Runner cannot be dynamically scaled, the jobs are often queued, the machine is idle for a long time, and it is not cost-effective to fill the annual and monthly configuration
  3. High cost: ARM64 architecture compilation requires another machine, Alibaba Cloud does not have ARM64 architecture instances in overseas regions, and AWS’s annual and monthly ARM64 machine price is relatively high (about 500$ per month for 32C64G)
  4. Network instability: GitHub-hosted Runner often times out when pushing images to domestic Alibaba Cloud warehouses

Our needs:

0x1: Accelerate the exploration of GitHub Actions

Based on our needs and the GitHub Action community documentation, we also found some solutions:

  1. K8s Controller: Kubernetes controller for GitHub Actions self-hosted runner
  2. Terraform: Autoscale AWS EC2 as GitHub Runner with Terraform and AWS Lambda
  3. Github: Paid Larger Runners service currently only open to GitHub Team and Enterprise organizations
  4. Cirun: Automatically scale VMs of cloud platforms such as AWS/GCP/AZURE/OpenStack as GitHub Runner
K8s Controller Terraform Github Larger Runners Cirun
Runner Container Linux, Windows Linux, Window, Mac Linux, Windows, Mac
Supported Cloud Platforms Kubernetes AWS - AWS, GCP, Azure, OpenStack
Whether to support ARM64 Supported Supported Not supported Supported
Whether Spot is supported Not supported Supported - Supported
Deployment Maintenance Cost Medium High None None

The first solution we tried was K8s Controller. After trying it out, we found the following defects:

If there is no way to use Fargate, it is necessary to prepare independent nodes, and there is no way to dynamically scale nodes and use pay-as-you-go instances and Spot instances.

Next we tried the Terraform solution, but also encountered some setbacks:

The GitHub solution does not support ARM64 instances, and passes directly.

In the end we chose Cirun:

Cirun: Supports customization of arbitrary machine specifications, architectures and images. It is free for open source projects, does not require deployment and maintenance, does not require additional resources, and is very simple to operate:

Step 1: Install the App Install Cirun APP in GitHub Marketplace

Install Cirun

Step 2: Add Repo Add the required repo in the Cirun console

Add Repo

Step 3: Configure AK/SK Configure AWS ACCESS KEY and Secret KEY in the Cirun console

AWS AUTH

Step 4: Configure Machine Specifications Machine specification and Runner Label are defined in GitHub Repo

runners:
  - name: "aws-amd64-32c"
    cloud: "aws"
    instance_type: "c6id.8xlarge"
    machine_image: "ami-097a2df4ac947655f"
    preemptible: true
    labels:
      - "aws-amd64-32c"
  - name: "aws-arm64-32c"
    cloud: "aws"
    instance_type: "c6g.8xlarge"
    machine_image: "ami-0a9790c5a531163ee"
    preemptible: true
    labels:
      - "aws-arm64-32c"

Step 5: Getting Started Toggle GitHub Job's runs-on field

jobs:
  build_agent:
    name: build agent
    runs-on: "cirun-aws-amd64-32c--${{ github.run_id }}"
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          submodules: recursive
          fetch-depth: 0

final effect:

GitHub Runner AWS Instance

0x2: Changes after DeepFlow uses Cirun

Currently DeepFlow uses AWS 32C64G Spot instances to run CI in parallel, with an average monthly consumption of $300. If you use the monthly subscription method, you can only run two 16C32G X86/ARM64 instances under the same consumption, and once there are parallel tasks, you need to wait in a long queue.

DeepFlow's main CIs have been switched to Cirun's Runners, meeting all previous expectations:

0x3: Future Outlook

We also encountered some problems in the middle of using Cirun, all of which have been well supported by the author, see Issues for details:

There is also some work in progress:

0x4: What are Spot Instances

Quoting the introduction of AWS official website:

  • The only difference between On-Demand Instances and Spot Instances is that when EC2 needs more capacity, it interrupts Spot Instances with a two-minute notice. You can use EC2 Spot for a variety of fault-tolerant and flexible applications, such as test and development environments, stateless web servers, image rendering, video transcoding, to run analytics, machine learning, and high-performance computing (HPC) workloads. EC2 Spot also tightly integrates with other AWS products, including EMR, Auto Scaling, Elastic Container Service (ECS), CloudFormation, and more, giving you flexibility in how to launch and maintain applications running on Spot Instances.
  • Spot Instances are a new way to buy and use Amazon EC2 instances. The spot price of Spot Instances changes periodically based on supply and demand. Start Spot Instances directly using a method similar to purchasing On-Demand Instances, and the price will be determined based on the supply and demand relationship (not exceeding the On-Demand Instance price); users can also set a maximum price, which will run during the period when the set maximum price is higher than the current spot price such instances. Spot Instances complement On-Demand and Reserved Instances and provide another option for obtaining compute capacity

0x5: What is DeepFlow

DeepFlow is an open source, highly automated observability platform. Link, high-performance data engine. DeepFlow uses new technologies such as eBPF, WASM, and OpenTelemetry, and innovatively implements core mechanisms such as AutoTracing, AutoMetrics, AutoTagging, and SmartEncoding, helping developers to improve the automation level of embedded code insertion and reduce the O&M complexity of the observability platform. Using DeepFlow's programmability and open interface, developers can quickly integrate it into their observability stack.

GitHub address: https://github.com/deepflowys/deepflow

Visit DeepFlow Demo to experience a new era of highly automated observability.


@aktech

This is a recent article I wrote to promote cirun. Can you give me some advice? Can you also give a brief introduction to what cirun is? example Watt is Cirun

aktech commented 1 year ago

Hey @mytting

This is a recent article I wrote to promote cirun. Can you give me some advice?

That looks pretty good, thanks for writing this. It would really very useful for folks who want to try different strategies. I think it would be interesting to add some cost numbers across different strategies. Also, there is some formatting issue with your table:

Syntax should be something like this:

| Syntax      | Description |
| ----------- | ----------- |
| Header      | Title       |
| Paragraph   | Text        |

Preview:

Syntax Description
Header Title
Paragraph Text

Can you also give a brief introduction to what cirun is? example Watt is Cirun

A brief introduction would be something like this:

Cirun is a way for developers and teams to run their CI/CD pipelines on their secure cloud infrastructure via GitHub Actions. The project aims to provide freedom to choose cloud machines with any configuration, saves money by using low cost instances and saves time by enabling unlimited concurrency and performant machine and all of this with a simple developer friendly yaml file. It currently support all major clouds including GCP, AWS, Oracle, DigitalOcean, Azure and on-premise cloud via OpenStack. Cirun is completely free for open source projects without any restrictions.

Nick-0314 commented 1 year ago

thanks

Nick-0314 commented 1 year ago

是的,这是一个突出的错误。它将在下一个版本中修复。

@aktech Hello, is there any progress on this issue? In recent days, X86 machines also have spot requests, which often affects CI.

aktech commented 1 year ago

Hey @mytting I didn't had the chance yet. I'm hoping to work on it this weekend. Apologies for the delay.

aktech commented 1 year ago

By the way is that blog post already published, the one you mentioned above? If yes, can you share the link please?

Nick-0314 commented 1 year ago

Yes, but only in Chinese. Is it convenient for you to check and share? https://mp.weixin.qq.com/s/26qbfq7bBmmgOk_NFWVUow https://deepflow.yunshan.net/blog/015-deepflow-uses-spot-Instances-to-speed-up-github-action-exploration/

aktech commented 1 year ago

Awesome, thanks a lot! i just used google translate to view the page. Is it possible for you to post it in english as well (just the google translate) on somewhere like say dev.to? I am happy to review the translation.

Nick-0314 commented 1 year ago

I'll try it next week.

aktech commented 1 year ago

Sure, no hurries.

aktech commented 1 year ago

I wrote a twitter thread, quoting from your blog: https://twitter.com/iaktech/status/1593574852241154049

aktech commented 1 year ago

@mytting I have changed the backend to use the latest CreateFleet API of AWS for creating spot instances: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html#which-spot-request-method-to-use

I did not see much difference in the runner creation time for smaller instances (like t2.medium), let me know if you see any difference (for your larger instances). I have also done some profiling to see if there are any bottlenecks anywhere else, I haven't found any so far apart from runner creation from AWS itself.

Nick-0314 commented 1 year ago

@aktech Is it a launch parameter passed through user-data? Is it convenient to provide a general user-data script? Let me test the difference between adding user-data and not.

Nick-0314 commented 1 year ago
Syntax Description
Header Title
Paragraph Text

https://dev.to/dundun/deepflow-uses-spot-instances-to-speed-up-github-action-exploration-2a90

aktech commented 1 year ago

@aktech Is it a launch parameter passed through user-data? Is it convenient to provide a general user-data script? Let me test the difference between adding user-data and not.

So, the launch template is created first (with user-data) via: create_launch_template and then it is passed to create_fleet.

https://dev.to/dundun/deepflow-uses-spot-instances-to-speed-up-github-action-exploration-2a90

Excellent, thanks a lot!

Nick-0314 commented 1 year ago

@aktech Is it a launch parameter passed through user-data? Is it convenient to provide a general user-data script? Let me test the difference between adding user-data and not.

So, the launch template is created first (with user-data) via: create_launch_template and then it is passed to create_fleet.

https://dev.to/dundun/deepflow-uses-spot-instances-to-speed-up-github-action-exploration-2a90

Excellent, thanks a lot!

OK, I'm interested in what is the general content of the user-data? The vm I created in the console started up quickly.

aktech commented 1 year ago

OK, I'm interested in what is the general content of the user-data? The vm I created in the console started up quickly.

Sure, I can share that. Can you join slack here. I'll DM you, after I create one minimal example.

Nick-0314 commented 1 year ago

ok