containers / storage

Container Storage Library
Apache License 2.0
556 stars 237 forks source link

OOMs in zstd code with zstd:chunked #2055

Open cgwalters opened 1 month ago

cgwalters commented 1 month ago

I am consistently getting skopeo being OOM killed when running this to convert a local overlay image I have to an OCI zstd:chunked format (so I can easily inspect/debug it):

skopeo --dest-compress-format zstd:chunked containers-storage:localhost/bootc-bound oci:/var/home/walters/tmp/oci:bootc-zstd-chunked :arrow_right: [195791.594936] Out of memory: Killed process 8421 (skopeo) total-vm:12788772kB, anon-rss:7695604kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:16388kB oom_score_adj:0

And yeah, I only have 8G allocated to this podman-machine instance right now.

This being my first time needing to profile a Go application's memory usage, some searching on the internet helped me craft this patch to skopeo:

diff --git a/cmd/skopeo/main.go b/cmd/skopeo/main.go
index 8b8de3c7..64831579 100644
--- a/cmd/skopeo/main.go
+++ b/cmd/skopeo/main.go
@@ -3,6 +3,8 @@ package main
 import (
    "context"
    "fmt"
+   "os"
+   "runtime/pprof"
    "strings"
    "time"

@@ -33,6 +35,7 @@ type globalOptions struct {
    commandTimeout     time.Duration           // Timeout for the command execution
    registriesConfPath string                  // Path to the "registries.conf" file
    tmpDir             string                  // Path to use for big temporary files
+   memprofile         string                  // Path to use for memory profiling
 }

 // requireSubcommand returns an error if no sub command is provided
@@ -91,6 +94,7 @@ func createApp() (*cobra.Command, *globalOptions) {
        logrus.Fatal("unable to mark registries-conf flag as hidden")
    }
    rootCommand.PersistentFlags().StringVar(&opts.tmpDir, "tmpdir", "", "directory used to store temporary files")
+   rootCommand.PersistentFlags().StringVar(&opts.memprofile, "memprofile", "", "write memory profiles to this path")
    flag := commonFlag.OptionalBoolFlag(rootCommand.Flags(), &opts.tlsVerify, "tls-verify", "Require HTTPS and verify certificates when accessing the registry")
    flag.Hidden = true
    rootCommand.AddCommand(
@@ -120,6 +124,23 @@ func (opts *globalOptions) before(cmd *cobra.Command, args []string) error {
    if opts.tlsVerify.Present() {
        logrus.Warn("'--tls-verify' is deprecated, please set this on the specific subcommand")
    }
+   if opts.memprofile != "" {
+       go func() {
+           ticker := time.NewTicker(time.Second * 5)
+
+           for range ticker.C {
+               f, err := os.Create(opts.memprofile + ".tmp")
+               if err != nil {
+                   panic(err)
+               }
+               pprof.WriteHeapProfile(f)
+               f.Close()
+               if err := os.Rename(opts.memprofile+".tmp", opts.memprofile); err != nil {
+                   panic(err)
+               }
+           }
+       }()
+   }
    return nil
 }

The results look like this:

(pprof) top
Showing nodes accounting for 3542.16MB, 99.63% of 3555.19MB total
Dropped 119 nodes (cum <= 17.78MB)
Showing top 10 nodes out of 20
      flat  flat%   sum%        cum   cum%
 2448.56MB 68.87% 68.87%  2448.56MB 68.87%  github.com/klauspost/compress/zstd.encoderOptions.encoder
    1072MB 30.15% 99.03%     1072MB 30.15%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist (inline)
   18.77MB  0.53% 99.55%    18.77MB  0.53%  github.com/klauspost/compress/zstd.(*blockEnc).init
    2.83MB 0.079% 99.63%   278.78MB  7.84%  github.com/klauspost/compress/zstd.(*Encoder).Reset
         0     0% 99.63%   140.49MB  3.95%  github.com/containers/storage/pkg/chunked/compressor.newTarSplitData
         0     0% 99.63%  3415.66MB 96.08%  github.com/containers/storage/pkg/chunked/compressor.writeZstdChunkedStream
         0     0% 99.63%  3135.88MB 88.21%  github.com/containers/storage/pkg/chunked/compressor.writeZstdChunkedStream.func3
         0     0% 99.63%  3415.66MB 96.08%  github.com/containers/storage/pkg/chunked/compressor.zstdChunkedWriterWithLevel.func1
         0     0% 99.63%   278.78MB  7.84%  github.com/containers/storage/pkg/chunked/internal.ZstdWriterWithLevel
         0     0% 99.63%  3135.88MB 88.21%  github.com/klauspost/compress/zstd.(*Encoder).Close

That's only 2.4G of memory, but presumably I caught it just before some more serious memory usage inflation.

The next thing I tried was doing "pure" zstd (not zstd:chunked), and that seems quite reliable. The profiler seems to report pretty flat usage:

(pprof) top
Showing nodes accounting for 82.50MB, 92.96% of 88.74MB total
Showing top 10 nodes out of 136
      flat  flat%   sum%        cum   cum%
      64MB 72.12% 72.12%       64MB 72.12%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist
    5.47MB  6.17% 78.28%     5.47MB  6.17%  github.com/klauspost/compress/zstd.encoderOptions.encoder
    4.63MB  5.21% 83.50%     4.63MB  5.21%  github.com/klauspost/pgzip.NewReader
    2.26MB  2.55% 86.04%     2.26MB  2.55%  github.com/klauspost/compress/zstd.(*blockEnc).init
    1.13MB  1.27% 87.32%     8.86MB  9.99%  github.com/klauspost/compress/zstd.(*Encoder).Reset
    1.01MB  1.14% 88.45%     1.01MB  1.14%  regexp/syntax.(*compiler).inst
       1MB  1.13% 89.58%        1MB  1.13%  github.com/modern-go/reflect2.(*unsafeType).UnsafeNew
       1MB  1.13% 90.71%     1.50MB  1.69%  github.com/tchap/go-patricia/v2/patricia.NewTrie
       1MB  1.13% 91.84%        1MB  1.13%  github.com/json-iterator/go.(*Iterator).ReadString
       1MB  1.13% 92.96%     2.50MB  2.82%  github.com/containers/storage/pkg/truncindex.(*TruncIndex).addID
rhatdan commented 1 month ago

@giuseppe @mtrmac PTAL

giuseppe commented 1 month ago

it happens only with a --compression-level >= 10

There is a discussion in https://github.com/klauspost/compress/discussions/655 and indeed zstd.WithEncoderConcurrency(1) seems to reduce memory requirements but still needs ~2GB to convert quay.io/fedora/fedora-bootc:40.

Should we lower the default compression level?

mtrmac commented 1 month ago

This being my first time needing to profile a Go application's memory usage

I have https://github.com/mtrmac/skopeo/tree/profile-options pending (inspired by Podman’s implementation) from earlier, but I never got around to making that production-ready.