enix / x509-certificate-exporter

A Prometheus exporter to monitor x509 certificates expiration in Kubernetes clusters or standalone
MIT License
632 stars 64 forks source link

Do not materialise labels when comparing certs #256

Closed kokes closed 5 months ago

kokes commented 6 months ago

Hey, similar to some other issues submitted recently, we faced OOM kills when running the exporter with a high number of k8s secrets. This PR seems to resolve #255 (the issue contains a reproducer).

While I thought the issue was the extremely memory intensive method getLabels, it was in a way just a part of the issue. While this method does allocate heavily, the main problem is that the exporter itself is slow. And since it doesn't have a context, it is not cancelled and many concurrect collections happen, all allocating and the process slowly gets OOM killed as memory mounts.

The new codebase allows for much faster parsing of certs (or, more specifically, their deduplication). This then makes collection pretty much instant and no memory gets accumulated.

Here's how this looked and how it looks now with the fix:

Screenshot 2024-03-04 at 11 07 54@2x

High level perf stats

Original memory profile (when just parsing certs):

(pprof) top
Showing nodes accounting for 3186.67MB, 98.59% of 3232.31MB total
Dropped 125 nodes (cum <= 16.16MB)
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
 1872.50MB 57.93% 57.93%  2059.50MB 63.72%  github.com/enix/x509-certificate-exporter/v3/internal.fillLabelsFromName
  448.05MB 13.86% 71.79%   662.07MB 20.48%  path.Join
  323.58MB 10.01% 81.80%  1126.67MB 34.86%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).getBaseLabels
  186.50MB  5.77% 87.57%      187MB  5.79%  fmt.Sprintf
  141.02MB  4.36% 91.94%   141.02MB  4.36%  strings.genSplit
  113.51MB  3.51% 95.45%   113.51MB  3.51%  path.(*lazybuf).string (inline)
  100.51MB  3.11% 98.56%   100.51MB  3.11%  path.(*lazybuf).append (inline)
       1MB 0.031% 98.59%       23MB  0.71%  crypto/x509.CreateCertificate
         0     0% 98.59%  3186.17MB 98.57%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).compareCertificates
         0     0% 98.59%  3186.17MB 98.57%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).getLabels

New memory profile:

(pprof) top
Showing nodes accounting for 187.62MB, 63.16% of 297.06MB total
Dropped 88 nodes (cum <= 1.49MB)
Showing top 10 nodes out of 133
      flat  flat%   sum%        cum   cum%
      40MB 13.47% 13.47%       40MB 13.47%  github.com/prometheus/client_golang/prometheus.MakeLabelPairs
   36.01MB 12.12% 25.59%    36.01MB 12.12%  github.com/prometheus/client_golang/prometheus.v2.NewDesc
   19.52MB  6.57% 32.16%    66.53MB 22.40%  crypto/x509.parseCertificate
   19.01MB  6.40% 38.56%    19.51MB  6.57%  github.com/enix/x509-certificate-exporter/v3/internal.fillLabelsFromName
      19MB  6.40% 44.96%    30.50MB 10.27%  encoding/asn1.makeField
   17.50MB  5.89% 50.85%    17.50MB  5.89%  crypto/x509/pkix.(*Name).FillFromRDNSequence
      12MB  4.04% 54.89%    17.50MB  5.89%  crypto/x509.parseName
       9MB  3.03% 57.92%        9MB  3.03%  vendor/golang.org/x/crypto/cryptobyte.(*String).ReadASN1ObjectIdentifier
    8.50MB  2.86% 60.78%     8.50MB  2.86%  encoding/pem.Decode
    7.05MB  2.37% 63.16%     9.32MB  3.14%  compress/flate.NewWriter

Benchmark comparison (before/after):

x509-certificate-exporter okokes$ benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/enix/x509-certificate-exporter/v3/internal
                      │   before.txt   │               after.txt                │
                      │     sec/op     │    sec/op     vs base                  │
ParsingCertificates-8   2685.78m ± 14%   39.92m ±  7%  -98.51% (p=0.000 n=10+7)

                      │   before.txt    │               after.txt                │
                      │      B/op       │     B/op      vs base                  │
ParsingCertificates-8   3182.520Mi ± 0%   9.326Mi ± 0%  -99.71% (p=0.000 n=10+7)

                      │  before.txt   │               after.txt               │
                      │   allocs/op   │  allocs/op   vs base                  │
ParsingCertificates-8   36376.7k ± 0%   125.1k ± 0%  -99.66% (p=0.000 n=10+7)

Changes made

I have some other changes drafted, but they have relatively low impact. Here are some numbers on the trimComponents part, which was perhaps the biggest offendor after the getLabels stuff was sorted.

x509-certificate-exporter okokes$ benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/enix/x509-certificate-exporter/v3/internal
                      │ before.txt  │              after.txt              │
                      │   sec/op    │   sec/op     vs base                │
ParsingCertificates-8   2.686 ± 14%   2.031 ± 12%  -24.40% (p=0.000 n=10)

                      │  before.txt  │              after.txt               │
                      │     B/op     │     B/op      vs base                │
ParsingCertificates-8   3.108Gi ± 0%   2.348Gi ± 0%  -24.44% (p=0.000 n=10)

                      │ before.txt  │              after.txt              │
                      │  allocs/op  │  allocs/op   vs base                │
ParsingCertificates-8   36.38M ± 0%   29.38M ± 0%  -19.23% (p=0.000 n=10)
kokes commented 6 months ago

Note that I have not run the test suite, because it fails to run not just for this PR, but also for main (I guess my environment is somehow broken).

Can you please run it, @npdgm? Thank you

kokes commented 6 months ago

Oh never mind, I got the tests to run against a fresh kind cluster and they pass.

paullaffitte commented 5 months ago

Hi! Thank you very much for you work! Your explanations are well detailed and the optimization is clever. I think it may be a bit dangerous in a maintainability perspective, in a sense that if one day we modify the getLabels function, we will have to remember to update the compareCertificates as well. But I would say that the benefits are worth the risk. Anyway, I will try to find a way to tackle this issue later, but for now I will merge your work.

Thanks again! :rocket:

monkeynator commented 5 months ago

:tada: This PR is included in version 3.13.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: