[bitnami/kafka] Healthcheck produces OOM on kafka

serut commented 7 months ago

Name and Version

bitnami/kafka:3.6.0

What architecture are you using?

amd64

What steps will reproduce the bug?

I've OOM on the healthcheck of Kafka container :

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": true,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2024-02-07T02:44:26.433758449Z",
            "FinishedAt": "2024-02-07T09:59:47.062914954Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 1,
                "Log": [
                    {
                        "Start": "2024-02-07T10:58:50.627061665+01:00",
                        "End": "2024-02-07T10:58:54.024116601+01:00",
                        "ExitCode": 0,
                        "Output": "Cluster ID: abcdefghijklmnopqrstug\n"
                    },
                    {
                        "Start": "2024-02-07T10:59:04.025932968+01:00",
                        "End": "2024-02-07T10:59:07.226737588+01:00",
                        "ExitCode": 0,
                        "Output": "Cluster ID: abcdefghijklmnopqrstug\n"
                    },
                    {
                        "Start": "2024-02-07T10:59:17.228531406+01:00",
                        "End": "2024-02-07T10:59:20.608769941+01:00",
                        "ExitCode": 0,
                        "Output": "Cluster ID: abcdefghijklmnopqrstug\n"
                    },
                    {
                        "Start": "2024-02-07T10:59:30.610299159+01:00",
                        "End": "2024-02-07T10:59:33.928822102+01:00",
                        "ExitCode": 0,
                        "Output": "Cluster ID: abcdefghijklmnopqrstug\n"
                    },
                    {
                        "Start": "2024-02-07T10:59:43.929545212+01:00",
                        "End": "2024-02-07T10:59:46.699016865+01:00",
                        "ExitCode": 137,
                        "Output": ""
                    }
                ]
            }

So I tried to increase the heap space memory. But I see the healcheck uses the same amount of heap space, which is not expected

deploy kafka as this (healthcheck + env KAFKA_HEAP_OPTS defined)

rs-kafka:
image: bitnami/kafka:3.6.0
user: "3050:1050"
hostname: rs-kafka-{{.Task.Slot}}
read_only: true
networks:
  - test
healthcheck:
  test: ["CMD", "kafka-cluster.sh", "cluster-id", "--bootstrap-server", "localhost:9092"]
  interval: 10s
  timeout: 10s
  retries: 10
  start_period: 15s
volumes:
  - type: tmpfs
    target: /opt/bitnami/kafka/config/
  - type: tmpfs
    target: /bitnami/kafka
  - type: tmpfs
    target: /tmp
environment:
  - "ALLOW_PLAINTEXT_LISTENER=True"
  - "ALLOW_ANONYMOUS_LOGIN=True"
  - "KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1"
  - "KAFKA_CFG_CONTROLLER_QUORUM_VOTERS={{.Task.Slot}}@rs-kafka-{{.Task.Slot}}:9093"
  - "KAFKA_CFG_NODE_ID={{.Task.Slot}}"
  - "KAFKA_KRAFT_CLUSTER_ID=abcdefghijklmnopqrstuv"
  - "KAFKA_CFG_PROCESS_ROLES=controller,broker"
  - "KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,PLAINTEXT_HOST://:29092,CONTROLLER://:9093"
  - "KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092,PLAINTEXT_HOST://:29092"
  - "KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT,CONTROLLER:PLAINTEXT"
  - "KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER"
  - "KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE=True"
  - "KAFKA_HEAP_OPTS=-Xmx1050m -Xms1050m"
deploy:
  restart_policy:
    condition: any
    delay: 1s
    max_attempts: 0
    window: 0s
  replicas: 1
  resources:
    limits:
        cpus: '1'
        memory: 3g
        pids: 100000
    reservations:
        cpus: '0.5'
        memory: 512M

Check the healthcheck command executed

I have no name!@rs-kafka-1:/$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[ this is the server]
3050         1  6.3  1.9 8000048 402192 ?      Ssl  14:40   0:35 /opt/bitnami/java/bin/java -Xmx1050m -Xms1050m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Dcom.sun.management.jmxremot
3050     16867  0.6  0.0   4140  2068 pts/0    Ss   14:50   0:00 bash
[ this is the healthcheck]
3050     16883  126  0.3 3972436 64724 ?       Ssl  14:50   0:01 /opt/bitnami/java/bin/java -Xmx1050m -Xms1050m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Dcom.sun.management.jmxremot
3050     17266  0.0  0.0   6740  1528 pts/0    R+   14:50   0:00 ps aux

The healcheck uses -Xmx1050m -Xms1050m but this property should only be used by the server

What is the expected behavior?

The file /opt/bitnami/kafka/bin//kafka-run-class.sh should not use the same VENV KAFKA_HEAP_OPTS as the service uses.

carrodher commented 7 months ago

Thank you for bringing this issue to our attention. We appreciate your involvement! If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

github-actions[bot] commented 7 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 6 months ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

serut commented 6 months ago

looks related to this issue : https://github.com/getsentry/self-hosted/issues/2567

My container is OOM even it has a lot of memory available (see 500Mb of usage) :

Container has a limit of 2.5G of memory !

It looks like it consumes a lot of memory in few seconds, and looks related to healthcheck :

$ dmesg
[...]
[4053449.343871] iptables: # DROP # IN=ens192 OUT= MAC=01:00:5e:00:00:01:55:55:55:55:55:55:08:00 SRC=0.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2 
[4053467.556077] kafka-admin-cli invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[4053467.556081] kafka-admin-cli cpuset=7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b mems_allowed=0
[4053467.556087] CPU: 1 PID: 66679 Comm: kafka-admin-cli Kdump: loaded Tainted: G               ------------ T 3.10.0-1160.105.1.el7.x86_64 #1
[4053467.556088] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[4053467.556089] Call Trace:
[4053467.556104]  [<ffffffff8bdb1bec>] dump_stack+0x19/0x1f
[4053467.556106]  [<ffffffff8bdacb4f>] dump_header+0x90/0x22d
[4053467.556111]  [<ffffffff8b8ad198>] ? ep_poll_callback+0xf8/0x220
[4053467.556115]  [<ffffffff8b7cce16>] ? find_lock_task_mm+0x56/0xd0
[4053467.556118]  [<ffffffff8b84a458>] ? try_get_mem_cgroup_from_mm+0x28/0x70
[4053467.556120]  [<ffffffff8b7cd3a5>] oom_kill_process+0x2d5/0x4a0
[4053467.556124]  [<ffffffff8b9223ec>] ? selinux_capable+0x1c/0x40
[4053467.556126]  [<ffffffff8b84e93c>] mem_cgroup_oom_synchronize+0x55c/0x590
[4053467.556127]  [<ffffffff8b84dd90>] ? mem_cgroup_charge_common+0xc0/0xc0
[4053467.556129]  [<ffffffff8b7cdca4>] pagefault_out_of_memory+0x14/0x90
[4053467.556131]  [<ffffffff8bdaaf88>] mm_fault_error+0x6a/0x15b
[4053467.556134]  [<ffffffff8bdbfa61>] __do_page_fault+0x4a1/0x510
[4053467.556136]  [<ffffffff8bdbfb05>] do_page_fault+0x35/0x90
[4053467.556137]  [<ffffffff8bdbb7b8>] page_fault+0x28/0x30
[4053467.556139] Task in /docker/7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b killed as a result of limit of /docker/7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b
[4053467.556143] memory: usage 2621440kB, limit 2621440kB, failcnt 86
[4053467.556144] memory+swap: usage 2621440kB, limit 5242880kB, failcnt 0
[4053467.556144] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4053467.556145] Memory cgroup stats for /docker/7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b: cache:2019380KB rss:602060KB rss_huge:215040KB mapped_file:684KB swap:0KB inactive_anon:1921664KB active_anon:699632KB inactive_file:0KB active_file:0KB unevictable:0KB
[4053467.556152] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4053467.556269] [89869] 1304050 89869  1998155   137604     420        0             0 java
[4053467.556277] [66288] 1304050 66288   798743    19263      99        0             0 java
[4053467.556278] Memory cgroup out of memory: Kill process 91997 (executor-Rebala) score 210 or sacrifice child
[4053467.557283] Killed process 89869 (java), UID 1304050, total-vm:7992620kB, anon-rss:532968kB, file-rss:17244kB, shmem-rss:204kB
[4053467.559325] C1 CompilerThre invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[4053467.559329] C1 CompilerThre cpuset=7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b mems_allowed=0
[4053467.559332] CPU: 0 PID: 66664 Comm: C1 CompilerThre Kdump: loaded Tainted: G               ------------ T 3.10.0-1160.105.1.el7.x86_64 #1
[4053467.559333] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[4053467.559334] Call Trace:
[4053467.559341]  [<ffffffff8bdb1bec>] dump_stack+0x19/0x1f
[4053467.559343]  [<ffffffff8bdacb4f>] dump_header+0x90/0x22d
[4053467.559348]  [<ffffffff8b8ad198>] ? ep_poll_callback+0xf8/0x220
[4053467.559352]  [<ffffffff8b7cce16>] ? find_lock_task_mm+0x56/0xd0
[4053467.559355]  [<ffffffff8b84a458>] ? try_get_mem_cgroup_from_mm+0x28/0x70
[4053467.559357]  [<ffffffff8b7cd3a5>] oom_kill_process+0x2d5/0x4a0
[4053467.559361]  [<ffffffff8b9223ec>] ? selinux_capable+0x1c/0x40
[4053467.559363]  [<ffffffff8b84e93c>] mem_cgroup_oom_synchronize+0x55c/0x590
[4053467.559366]  [<ffffffff8b84dd90>] ? mem_cgroup_charge_common+0xc0/0xc0
[4053467.559368]  [<ffffffff8b7cdca4>] pagefault_out_of_memory+0x14/0x90
[4053467.559370]  [<ffffffff8bdaaf88>] mm_fault_error+0x6a/0x15b
[4053467.559374]  [<ffffffff8bdbfa61>] __do_page_fault+0x4a1/0x510
[4053467.559376]  [<ffffffff8bdbfb05>] do_page_fault+0x35/0x90
[4053467.559378]  [<ffffffff8bdbb7b8>] page_fault+0x28/0x30
[4053467.559379] Task in / killed as a result of limit of /docker/7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b
[4053467.559382] memory: usage 2621440kB, limit 2621440kB, failcnt 93
[4053467.559383] memory+swap: usage 2621440kB, limit 5242880kB, failcnt 0
[4053467.559384] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[4053467.559385] Memory cgroup stats for /docker/7bff4ccbb0bda7825d84ae5be7166abc8dc37f024cca18d7c6bbc918e8ba593b: cache:2019380KB rss:526288KB rss_huge:139264KB mapped_file:684KB swap:0KB inactive_anon:1921664KB active_anon:699656KB inactive_file:0KB active_file:0KB unevictable:0KB
[4053467.559577] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[4053467.559627] [66288] 1304050 66288   798743    19263      99        0             0 java
[4053467.559629] Memory cgroup out of memory: Kill process 91773 (data-plane-kafk) score 213 or sacrifice child
[4053468.014922] br0: port 5(veth15) entered disabled state
[4053468.040911] docker_gwbridge: port 4(veth8fe67ae) entered disabled state
[4053468.058910] docker_gwbridge: port 4(veth8fe67ae) entered disabled state
[4053468.064063] device veth8fe67ae left promiscuous mode
[4053468.064086] docker_gwbridge: port 4(veth8fe67ae) entered disabled state
[4053468.077785] br0: port 5(veth15) entered disabled state
[4053468.082879] device veth15 left promiscuous mode
[4053468.082901] br0: port 5(veth15) entered disabled state
[4053478.350456] br0: port 5(veth17) entered blocking state
[4053478.350462] br0: port 5(veth17) entered disabled state
[4053478.350569] device veth17 entered promiscuous mode
[4053478.353565] docker_gwbridge: port 4(veth3305e9d) entered blocking state
[4053478.353570] docker_gwbridge: port 4(veth3305e9d) entered disabled state
[4053478.353682] device veth3305e9d entered promiscuous mode
[4053478.353977] docker_gwbridge: port 4(veth3305e9d) entered blocking state
[4053478.353979] docker_gwbridge: port 4(veth3305e9d) entered forwarding state
[4053478.433284] IPVS: Creating netns size=2048 id=41
[4053478.549030] docker_gwbridge: port 4(veth3305e9d) entered disabled state
[4053478.549079] br0: port 5(veth17) entered blocking state
[4053478.549081] br0: port 5(veth17) entered forwarding state
[4053478.564834] docker_gwbridge: port 4(veth3305e9d) entered blocking state
[4053478.564839] docker_gwbridge: port 4(veth3305e9d) entered forwarding state

I reproduce it 3 times per day, so I will tell you if it fixes something to use nc instead of the kafka cli

bitnami / containers