IBM / core-dump-handler

Save core dumps from a Kubernetes Service or RedHat OpenShift to an S3 protocol compatible object store
https://ibm.github.io/core-dump-handler/
MIT License
131 stars 40 forks source link

composer.log is missing in agent #136

Closed bdbrink closed 1 year ago

bdbrink commented 1 year ago

Hello,

Using v8.10.0 with EKS 1.23 with IRSA I was able to install the agent and get it running, but when viewing the /var/mnt/core-dump-handler the composer.log is missing. When triggering a core dump nothing happens within the agent (assuming thats due to the missing log file).

Agent running:

[2023-02-07T17:04:29Z INFO  core_dump_agent] no .env file found 
     That's ok if running in kubernetes
[2023-02-07T17:04:29Z INFO  core_dump_agent] Setting host location to: /var/mnt/core-dump-handler
[2023-02-07T17:04:29Z INFO  core_dump_agent] Current Directory for setup is /app
[2023-02-07T17:04:29Z INFO  core_dump_agent] Copying the crictl from ./crictl to /var/mnt/core-dump-handler/crictl
[2023-02-07T17:04:29Z INFO  core_dump_agent] Copying the composer from ./vendor/default/cdc to /var/mnt/core-dump-handler/cdc
[2023-02-07T17:04:29Z INFO  core_dump_agent] Starting sysctl for kernel.core_pattern /var/mnt/core-dump-handler/core_pattern.bak with |/var/mnt/core-dump-handler/cdc -c=%c -e=%e -p=%p -s=%s -t=%t -d=/var/mnt/core-dump-handler/cores -h=%h -E=%E
[2023-02-07T17:04:29Z INFO  core_dump_agent] Getting sysctl for kernel.core_pattern
[2023-02-07T17:04:29Z INFO  core_dump_agent] Created Backup of /var/mnt/core-dump-handler/core_pattern.bak
[2023-02-07T17:04:29Z INFO  core_dump_agent] Starting sysctl for kernel.core_pipe_limit /var/mnt/core-dump-handler/core_pipe_limit.bak with 128
[2023-02-07T17:04:29Z INFO  core_dump_agent] Getting sysctl for kernel.core_pipe_limit
kernel.core_pattern = |/var/mnt/core-dump-handler/cdc -c=%c -e=%e -p=%p -s=%s -t=%t -d=/var/mnt/core-dump-handler/cores -h=%h -E=%E
[2023-02-07T17:04:29Z INFO  core_dump_agent] Created Backup of /var/mnt/core-dump-handler/core_pipe_limit.bak
kernel.core_pipe_limit = 128
[2023-02-07T17:04:29Z INFO  core_dump_agent] Starting sysctl for fs.suid_dumpable /var/mnt/core-dump-handler/suid_dumpable.bak with 2
[2023-02-07T17:04:29Z INFO  core_dump_agent] Getting sysctl for fs.suid_dumpable
[2023-02-07T17:04:29Z INFO  core_dump_agent] Created Backup of /var/mnt/core-dump-handler/suid_dumpable.bak
fs.suid_dumpable = 2
[2023-02-07T17:04:29Z INFO  core_dump_agent] Creating /var/mnt/core-dump-handler/.env file with LOG_LEVEL=Debug
[2023-02-07T17:04:29Z INFO  core_dump_agent] Writing composer .env 
    LOG_LEVEL=Debug
    IGNORE_CRIO=false
    CRIO_IMAGE_CMD=img
    USE_CRIO_CONF=false
    FILENAME_TEMPLATE={uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}
    LOG_LENGTH=500
    POD_SELECTOR_LABEL=
    TIMEOUT=600
    COMPRESSION=true
    CORE_EVENTS=true
    EVENT_DIRECTORY=/var/mnt/core-dump-handler/events

[2023-02-07T17:04:29Z INFO  core_dump_agent] Executing Agent with location : /var/mnt/core-dump-handler/cores
[2023-02-07T17:04:29Z INFO  core_dump_agent] Dir Content []
[2023-02-07T17:04:29Z INFO  core_dump_agent] INotify Starting...
[2023-02-07T17:04:29Z INFO  core_dump_agent] INotify Initialised...
[2023-02-07T17:04:29Z INFO  core_dump_agent] INotify watching : /var/mnt/core-dump-handler/cores

Executing into pod:

/var/mnt/core-dump-handler # ls
cdc                  core_pattern.bak     core_pipe_limit.bak  cores                crictl               crictl.yaml          suid_dumpable.bak

I've tried to update the vendor to rhel7 according to other issues seen with EKS, but this throws an error:

[2023-02-07T17:20:12Z INFO  core_dump_agent] Copying the composer from ./vendor/rhel7/cdc to /var/mnt/core-dump-handler/cdc
Error: No such file or directory (os error 2)
No9 commented 1 year ago

Hi @bdbrink Thanks for taking the time to look at the project. I think you are on the right track. A failing cdc with no composer.log is likely caused by the wrong cdc being deployed. Unfortunately it is very hard to capture the libc incompatibility problem.

Can you confirm the --values flags used for the install? As per the docs I would expect you used --values values.aws.yaml

I have looked at the v8.10.0 image and have confirmed the cdc is there

podman run -it quay.io/icdh/core-dump-handler:v8.10.0 /bin/bash
[root@67269174863d app]# ls ./vendor/rhel7/
cdc

So I can only assume that the copy failure on the second attempt is due to some issues mounting the volume after the reinstall. Can you uninstall the helm chart again and confirm the volumes have also been deleted before reinstalling with the values flag.

If the cdc still doesn't copy over can you log into the agent and try the copy manually?

cp ./vendor/rhel7/cdc /var/mnt/core-dump-handler/cdc

It seem like we have two possible issues here so apologies for all the questions.

bdbrink commented 1 year ago

@No9 Thanks for quick response!

I specified values.aws.sts.yaml for the chart values, deleted the volumes and gave it a fresh redeploy but the error is still persisting, when specifying to default the agent begins to run, but still no composer.log. The pods are crashlooping so I can't get into the agent to copy over manually (unless there is another way to do that). I'm using the arm64 image and the k8s nodes are using Amazon Linux2 with containerd for the runtime if any of that helps.

No9 commented 1 year ago

Can I clarify exactly where we are as I've got a feeling I'm going to have to try and replicate this:

You are running EKS 1.23 on arm64 with Linux2 worker nodes. You are using the arm images from quay.io/icdh/core-dump-handler-musl:v8.10.0 so you have updated the main values.yaml to be:

image:
  registry: quay.io
  repository: icdh/core-dump-handler-musl
  tag: v8.10.0

or you are supplying --set image.repository=quay.io/icdh/core-dump-handler-musl as an cli option.

When you run this values.aws.sts.yaml with no other yaml changes or command options you get a crash loopback off.

# AWS requires a crio client to be copied to the server
daemonset:
  includeCrioExe: true
  vendor: rhel7 # EKS EC2 images have an old libc=2.26

serviceAccount:
  annotations:
    # See https://docs.aws.amazon.com/eks/latest/userguide/specify-service-account-role.html
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789000:role/iam-role-name-here

When you run it with default it works but still no composer log.

# AWS requires a crio client to be copied to the server
daemonset:
  includeCrioExe: true
  vendor: default # <---------- This is the change

serviceAccount:
  annotations:
    # See https://docs.aws.amazon.com/eks/latest/userguide/specify-service-account-role.html
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789000:role/iam-role-name-here

Setting the composer log to info with --set composer.logLevel=Info still doesn't provide any additional information after a core dump has been generated. Or in values.aws.sts.yaml

composer:
  logLevel: "Info"

Please correct me if I have misunderstood something as I want to be sure I'm on the same config as you.

bdbrink commented 1 year ago

Yes thats the exact setup I'm using/running, doing this all in the helm charts (deploying with v3.9.4 of helm), not using the cli for any inputs.

No9 commented 1 year ago

OK so it looks like default should be used as that successfully runs the agent. I'll try and login to that and copy over the cdc Then I will be trying to do is get onto a host node and run the cdc --help to see if there are any dependency weirdness that has been introduced. Won't be for a few days though