policy engine pod restarts with OOMKilled

reschenburgIDBS commented 4 years ago

Request for help!

I think.. may well be a bug in the helm chart.

In short, my policy engine keeps restarting with an OOMKilled error. I don't think it's causing any actual issues, at least not that I've noticed, but its still annoying.

I've set the resource limits and requests for the policy engine container as follows:

    Limits:
      cpu:     2
      memory:  8Gi
    Requests:
      cpu:      500m
      memory:   6Gi

deployed versions:

NAME            REVISION        UPDATED                         STATUS          CHART                   APP VERSION     NAMESPACE  
anchore         8               Mon May  4 07:40:33 2020        DEPLOYED        anchore-engine-1.4.2    0.6.1           anchore

everything else works fine:

NAME                                                  READY   STATUS    RESTARTS   AGE
anchore-anchore-engine-analyzer-74c5b9966c-r469r      1/1     Running   0          31d
anchore-anchore-engine-api-778c7cdb6c-qdv2q           1/1     Running   0          31d
anchore-anchore-engine-catalog-665bc7954c-mgh9j       1/1     Running   0          33d
anchore-anchore-engine-policy-755d9fcd4-4pfgx         1/1     Running   139        19d
anchore-anchore-engine-simplequeue-5b57b9f9fc-l2k79   1/1     Running   0          33d
anchore-cli                                           1/1     Running   0          33d
anchore-postgresql-77cf987fcc-2pbxg                   1/1     Running   0          33d

cli version anchore-cli, version 0.7.1

What docker images are you using: docker.io/anchore/anchore-engine:v0.6.0

Here's a fun picture of what that looks like on a graph:

I suspect I'm missing the an Xmx equivalent setting somewhere - any help would be much appreciated!

reschenburgIDBS commented 4 years ago

If it helps I'm not sure, but here are some logs from around the time of the 2nd spike in the picture:

[service:policy-engine] 2020-06-03 10:31:30+0000 [-] [Thread-44] [anchore_engine.services.policy_engine.engine.logs/info()] [INFO] Db merge took 265.92521929740906 sec
[service:policy-engine] 2020-06-03 10:31:34+0000 [-] "10.0.3.109" - - [03/Jun/2020:10:31:33 +0000] "GET /health HTTP/1.1" 200 - "-" "kube-probe/1.15+"
[service:policy-engine] 2020-06-03 10:31:35+0000 [-] "10.0.3.109" - - [03/Jun/2020:10:31:34 +0000] "GET /health HTTP/1.1" 200 - "-" "kube-probe/1.15+"
[service:policy-engine] 2020-06-03 10:31:46+0000 [-] "10.0.3.109" - - [03/Jun/2020:10:31:46 +0000] "GET /health HTTP/1.1" 200 - "-" "kube-probe/1.15+"
[service:policy-engine] 2020-06-03 10:31:46+0000 [-] "10.0.3.109" - - [03/Jun/2020:10:31:46 +0000] "GET /health HTTP/1.1" 200 - "-" "kube-probe/1.15+"
[service:policy-engine] 2020-06-03 10:31:46+0000 [-] [Thread-8] [anchore_engine.services.policy_engine/handle_feed_sync_trigger()] [INFO] Feed Sync task creator activated
[service:policy-engine] 2020-06-03 10:31:46+0000 [-] [Thread-8] [anchore_engine.services.policy_engine/handle_feed_sync_trigger()] [INFO] Feed Sync Trigger done, waiting for next cycle.
[service:policy-engine] 2020-06-03 10:31:47+0000 [-] [Thread-8] [anchore_engine.services.policy_engine/handle_feed_sync_trigger()] [INFO] Feed Sync task creator complete

The actual alert that memory had gone above 7GB occurred at 10:34:20 - after the Feed Sync task creator completed.

zhill commented 4 years ago

There was a significant improvement in memory usage in the 0.6.1 release of Engine. This was a known issue with 0.6.0 that is resolved via an upgrade. Upgrading to 0.6.1 has no db upgrade requirements, so is relatively fast and safe. Or you can upgrade all the way to 0.7.1, which does have a db upgrade step but has other benefits as well.

The issue you're seeing is the policy engine working thru big chunks of data during the feed sync process. The fix in 0.6.1 is to spool that data to the disk to avoid as much in-memory processing as possible to keep memory footprint small. After the upgrade you should see usage in the 800MB range for the feed sync process, instead of 6GB+.

zhill commented 4 years ago

Any updates on this if the upgrade fixed it @reschenburgIDBS ?

billythach commented 2 years ago

Hi @zhill ! I have the same problem with version 1.1.0 deployed with https://github.com/anchore/anchore-charts When i launch a global scan from Harbor (over a thousand image), this issue seems to be more visible or happens faster : harbor launch 10 simultaneous scans. 2022-05-05_11h02_49 Theses pods are sizing with a large memory :

Analyzer - 10G
Policy and Catalog - 12G

Maybe there is a leak memory issue ?

anchore / anchore-engine

policy engine pod restarts with OOMKilled #476