crossplane-contrib / provider-upjet-aws

Official AWS Provider for Crossplane by Upbound.
https://marketplace.upbound.io/providers/upbound/provider-aws
Apache License 2.0
143 stars 120 forks source link

Decrease linter's memory usage #1194

Closed ulucinar closed 6 months ago

ulucinar commented 7 months ago

Description of your changes

Depends on: https://github.com/upbound/uptest/pull/187

Historical Context & Problem Statement

The linter jobs in the upbound/provider-aws repository have been facing recurring failures. Initially, these failures were mitigated by switching to larger self-hosted runners (runners with the e2-standard-8 labels in the upbound organization), but the issues resurfaced due to a performance regression in the musttag linter. Subsequently, we upgraded the golangci-lint version to v1.54.0 which resolved the performance regression with that specific linter. But later, we encountered further challenges, prompting a switch to an even larger runners. The runners we currently use only for the linter job with the label Ubuntu-Jumbo-Runner. Despite these adjustments, the linter jobs has started failing again, primarily due to the high peak memory consumption during linting, with cold analysis cache runs consuming over 50 GB of peak memory, depending on the number of available CPU cores.

Investigation & Findings

The substantial memory usage was traced back to the linter runner's analysis phase. We considered (and investigated some of) the following potential remediations for the linter issues we've been experiencing:

Implemented Strategy in this PR

The analysis cache expires in 7 days or when the module’s go.sum changes. If the analysis cache can successfully be restored, then the initial cache construction phase is skipped and just full linting with the maximum available concurrency is performed.

For generating the build constraints (tags), we use the buildtagger tool. We currently don’t utilize the build tags for building the resource providers in isolation because:

Observed Improvements

In an example run with the two-phase strategy, the cache construction phase consumed a peak memory of ~13 GB and the full linting phase consumed a peak of 24,3 GB, which corresponds to a ~57% reduction in peak memory consumption compared to a single phase run of the linters on the same machine (an M2 Mac with 12 cores). The total execution time of both phases is ~14m, which is about the same time it takes the linters to run in a single phase (when we run the linters in a single phase on cold analysis cache, the peak memory consumption was ~57 GB and the execution time was ~14 min):

Here are results from example runs on an M2 Mac with 12 logical cores and 32 GB of physical memory:

  1. Linter run in single phase (without the proposed initial cache construction phase) on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version: v1.55.2: Average memory consumption is 40560.9MB, max is 58586.2MB. Execution took 13m49.188064208s.

  2. Linter run in single phase with the analysis cache with the same parameters as above: Average memory consumption is 104.5MB, max is 191.1MB. Execution took 7.325796125s.

  3. Two-phase linter run example on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version is v1.55.2: Average memory consumption in the cache construction phase is 9272.4MB and the peak consumption is 13301.6MB. Execution of the first phase took 11m13.023597833s. For the second phase (full linting with all the available CPU cores), the average memory consumption is 9331.7MB and the peak is 24904.9MB. The execution of the second phase took 3m10.447865375s.

The linter job now fits into a (standard) Github hosted runner with the label ubuntu-22.04 (with 16 GB of physical memory & 4 cores). So, in preparation of moving provider-aws out of the upbound Github organization, this PR also changes the lint job's runner from Ubuntu-Jumbo-Runner to ubuntu-22.04.

Developer Experience

I have:

How has this code been tested

ulucinar commented 7 months ago

/test-examples="examples/eks/v1beta1/cluster.yaml"

ulucinar commented 7 months ago

/test-examples="examples/ec2/v1beta1/vpc.yaml"