The linter jobs in the upbound/provider-aws repository have been facing recurring failures. Initially, these failures were mitigated by switching to larger self-hosted runners (runners with the e2-standard-8 labels in the upbound organization), but the issues resurfaced due to a performance regression in the musttag linter. Subsequently, we upgraded the golangci-lint version to v1.54.0 which resolved the performance regression with that specific linter. But later, we encountered further challenges, prompting a switch to an even larger runners. The runners we currently use only for the linter job with the label Ubuntu-Jumbo-Runner. Despite these adjustments, the linter jobs has started failing again, primarily due to the high peak memory consumption during linting, with cold analysis cache runs consuming over 50 GB of peak memory, depending on the number of available CPU cores.
Investigation & Findings
The substantial memory usage was traced back to the linter runner's analysis phase. We considered (and investigated some of) the following potential remediations for the linter issues we've been experiencing:
Just configuring the skip-files to exclude, for instance, the generated files as we do right now for the official provider repos, does not help because it just skips reporting the issues. All the generated files are still analyzed and the processing needed to analyze the source code is the memory consuming part. So it does not help us.
Tried bumping the linter runner to the latest version with a hope to enjoy any upstream improvements. The memory consumption is still high.
Tried different sets of enabled linters & examined the resulting memory profiles. This has a potential but the processing needed for the analysis is common to the linters and probably defines a lower bound on the memory consumed. I've seen ~20 GB memory consumption even with a single linter configured. Configuring the linter runner only with the default set of linters (errcheck, gosimple, govet , ineffassign, staticcheck, unused) still consumes over 30 GB of memory. An idea could be to partition the current set of enabled linters across multiple jobs and hope no job will hit the memory limits.
Employ the build tags that will make the analysis actually less costly in contrast to theskip-files argument.
Switch to even larger runners: Ideally we would like to avoid this. We will also be moving their repositories out of the upbound Github organization.
Decrease the concurrency of the linter runner: The linter runner by default uses all the available logical cores available to it. Concurrency-limiting like we did for building & pushing the provider families to limit the load on our package registry may help here, in expense of increased runtime. The common processing phase in the runner prior to the analysis phase could be a limiting factor (please see above).
Increase the pressure on the garbage collector: A common technique with the expense of increased CPU consumption and runtime.
Disable certain linters: We are also trying to avoid this. But if we figure out a single linter that's breaking the memory limit, might be worth implementing.
Break the family provider repos that need it: This is problematic because the resource providers of the same family are coupled (common provider configuration, cross-resource references ,etc.), although we had to decrease this coupling (specifically cross-resource reference induced coupling) for the multi-version CRD support. Also there's the maintenance overhead of the extra repositories. Furthermore, we can achieve a similar effect via the build tags (please see above).
Implemented Strategy in this PR
Run the linter in two phases:
Employ build constraints (tags) for the initial construction of a linter runner analysis cache and limit the concurrency to 1 when constructing this cache. We avoid the two hotspots in the provider-aws repo during cache construction: API registration in apis/zz_register.go & API group configuration in config/provider.go.
After the cache construction, run the linters on the whole provider codebase with the maximum available concurrency.
The analysis cache expires in 7 days or when the module’s go.sum changes. If the analysis cache can successfully be restored, then the initial cache construction phase is skipped and just full linting with the maximum available concurrency is performed.
For generating the build constraints (tags), we use the buildtagger tool. We currently don’t utilize the build tags for building the resource providers in isolation because:
We currently don’t need this, i.e., the actual build process is not as resource hungry as linting
It impacts the developer experience: It has implications in IDE settings, it will require changes to our build pipelines, etc.
We would need to implement dependency resolution for cross-resource references and we’ve so far avoided this for the family providers (increased complexity), or, we would need to require the manual maintenance of dependencies (which is error prone).
Observed Improvements
In an example run with the two-phase strategy, the cache construction phase consumed a peak memory of ~13 GB and the full linting phase consumed a peak of 24,3 GB, which corresponds to a ~57% reduction in peak memory consumption compared to a single phase run of the linters on the same machine (an M2 Mac with 12 cores). The total execution time of both phases is ~14m, which is about the same time it takes the linters to run in a single phase (when we run the linters in a single phase on cold analysis cache, the peak memory consumption was ~57 GB and the execution time was ~14 min):
Here are results from example runs on an M2 Mac with 12 logical cores and 32 GB of physical memory:
Linter run in single phase (without the proposed initial cache construction phase) on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version: v1.55.2:
Average memory consumption is 40560.9MB, max is 58586.2MB. Execution took 13m49.188064208s.
Linter run in single phase with the analysis cache with the same parameters as above:
Average memory consumption is 104.5MB, max is 191.1MB. Execution took 7.325796125s.
Two-phase linter run example on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version is v1.55.2:
Average memory consumption in the cache construction phase is 9272.4MB and the peak consumption is 13301.6MB.
Execution of the first phase took 11m13.023597833s. For the second phase (full linting with all the available CPU cores), the average memory consumption is 9331.7MB and the peak is 24904.9MB. The execution of the second phase took 3m10.447865375s.
The linter job now fits into a (standard) Github hosted runner with the label ubuntu-22.04 (with 16 GB of physical memory & 4 cores). So, in preparation of moving provider-aws out of the upbound Github organization, this PR also changes the lint job's runner from Ubuntu-Jumbo-Runner to ubuntu-22.04.
Developer Experience
API group configuration is implemented with slight differences:
The actual configuration still resides under config/<API group prefix, e.g., acm>/config.go. There are no upjet resource configuration API changes, i.e., the contents of config.go stays the same, and because the build tags for these files are automatically generated, there's no need to manually tag these files.
Registrations of the API group “Configure” functions are still done in config/provider.go but in a slightly different way. The new registration method is like follows:
make lint can be run with its old semantics. When the linter is running, the generated build tags will be observed in the source tree. If uninterrupted, these tags will be removed by the make target.
make build can be run with its old semantics & behavior
I have:
[x] Run make reviewable test to ensure this PR is ready for review.
Description of your changes
Depends on: https://github.com/upbound/uptest/pull/187
Historical Context & Problem Statement
The linter jobs in the
upbound/provider-aws
repository have been facing recurring failures. Initially, these failures were mitigated by switching to larger self-hosted runners (runners with thee2-standard-8
labels in theupbound
organization), but the issues resurfaced due to a performance regression in themusttag
linter. Subsequently, we upgraded thegolangci-lint
version tov1.54.0
which resolved the performance regression with that specific linter. But later, we encountered further challenges, prompting a switch to an even larger runners. The runners we currently use only for the linter job with the labelUbuntu-Jumbo-Runner
. Despite these adjustments, the linter jobs has started failing again, primarily due to the high peak memory consumption during linting, with cold analysis cache runs consuming over 50 GB of peak memory, depending on the number of available CPU cores.Investigation & Findings
The substantial memory usage was traced back to the linter runner's analysis phase. We considered (and investigated some of) the following potential remediations for the linter issues we've been experiencing:
skip-files
to exclude, for instance, the generated files as we do right now for the official provider repos, does not help because it just skips reporting the issues. All the generated files are still analyzed and the processing needed to analyze the source code is the memory consuming part. So it does not help us.errcheck
,gosimple
,govet
,ineffassign
,staticcheck
,unused
) still consumes over 30 GB of memory. An idea could be to partition the current set of enabled linters across multiple jobs and hope no job will hit the memory limits.skip-files
argument.upbound
Github organization.Implemented Strategy in this PR
provider-aws
repo during cache construction: API registration inapis/zz_register.go
& API group configuration inconfig/provider.go
.The analysis cache expires in 7 days or when the module’s
go.sum
changes. If the analysis cache can successfully be restored, then the initial cache construction phase is skipped and just full linting with the maximum available concurrency is performed.For generating the build constraints (tags), we use the buildtagger tool. We currently don’t utilize the build tags for building the resource providers in isolation because:
Observed Improvements
In an example run with the two-phase strategy, the cache construction phase consumed a peak memory of ~13 GB and the full linting phase consumed a peak of 24,3 GB, which corresponds to a ~57% reduction in peak memory consumption compared to a single phase run of the linters on the same machine (an M2 Mac with 12 cores). The total execution time of both phases is ~14m, which is about the same time it takes the linters to run in a single phase (when we run the linters in a single phase on cold analysis cache, the peak memory consumption was ~57 GB and the execution time was ~14 min):
Here are results from example runs on an M2 Mac with 12 logical cores and 32 GB of physical memory:
Linter run in single phase (without the proposed initial cache construction phase) on cold analysis cache on the main branch’s Head (commit:
fb0fb486e6225cdab27a447c48cb36f98464884e
). Linter runner version:v1.55.2
: Average memory consumption is 40560.9MB, max is 58586.2MB. Execution took 13m49.188064208s.Linter run in single phase with the analysis cache with the same parameters as above: Average memory consumption is 104.5MB, max is 191.1MB. Execution took 7.325796125s.
Two-phase linter run example on cold analysis cache on the main branch’s Head (commit:
fb0fb486e6225cdab27a447c48cb36f98464884e
). Linter runner version isv1.55.2
: Average memory consumption in the cache construction phase is 9272.4MB and the peak consumption is 13301.6MB. Execution of the first phase took 11m13.023597833s. For the second phase (full linting with all the available CPU cores), the average memory consumption is 9331.7MB and the peak is 24904.9MB. The execution of the second phase took 3m10.447865375s.The linter job now fits into a (standard) Github hosted runner with the label
ubuntu-22.04
(with 16 GB of physical memory & 4 cores). So, in preparation of movingprovider-aws
out of theupbound
Github organization, this PR also changes thelint
job's runner fromUbuntu-Jumbo-Runner
toubuntu-22.04
.Developer Experience
config/<API group prefix, e.g., acm>/config.go
. There are no upjet resource configuration API changes, i.e., the contents ofconfig.go
stays the same, and because the build tags for these files are automatically generated, there's no need to manually tag these files.config/provider.go
but in a slightly different way. The new registration method is like follows:make lint
can be run with its old semantics. When the linter is running, the generated build tags will be observed in the source tree. If uninterrupted, these tags will be removed by the make target.make build
can be run with its old semantics & behaviorI have:
make reviewable test
to ensure this PR is ready for review.How has this code been tested
The linter runner can successfully report issues: https://github.com/upbound/provider-aws/actions/runs/8196350610/job/22416386815?pr=1194
If the linter runner's analysis cache is not restored, it will run to populate the cache without reporting any issues (even if there actually are issues): https://github.com/upbound/provider-aws/actions/runs/8196350610/job/22416386815?pr=1194
If the linter runner's analysis cache could be successfully restored, then the cache population step is skipped: https://github.com/upbound/provider-aws/actions/runs/8196537292/job/22416975348?pr=1194
Uptest run for
Cluster.eks
: https://github.com/upbound/provider-aws/actions/runs/8209193860