IBM / core-dump-handler

Save core dumps from a Kubernetes Service or RedHat OpenShift to an S3 protocol compatible object store
https://ibm.github.io/core-dump-handler/
MIT License
127 stars 38 forks source link

Get podname and namespace "unknown" #102

Open wsszh opened 2 years ago

wsszh commented 2 years ago

Hi, I set the filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{podname}-{namespace}", but I get the filename like this: "9a1fc79c-758c-4599-a22d-2e94444a3250-dump-1657867608-segfaulter-segfaulter-1-4-unknown-unknown.zip". How to fix it?

joaogbcravo commented 1 year ago

Hey, I also got this unknown. How did you solve it?

Robert-Stam commented 6 months ago

We see his behaviour here as well (on AWS) any news? I see the issue is set the closed, however it doesn't seems to be resolved?

No9 commented 6 months ago

Hey @Robert-Stam Can you confirm which aws.values.xxx.yaml you have used in the deployment and which version of EKS you are using. It's likely that the version of crio is now outdated as this hasn't been updated for a while.

Robert-Stam commented 6 months ago

Hey @Robert-Stam Can you confirm which aws.values.xxx.yaml you have used in the deployment and which version of EKS you are using. It's likely that the version of crio is now outdated as this hasn't been updated for a while.

I have used the settings from: https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.aws.yaml

We are using Kubernetes 1.28 (on Intel hardware, m6i family) with the AMI: amazon-eks-node-1.28-v20240110 See: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240110

image

Thanks in advance!

Robert-Stam commented 4 months ago

@No9 Hi Anton, any update on this?

No9 commented 4 months ago

I don't have access to an AWS account to debug. Can you log into an agent container that has processed a core dump and provide the output of

cat /var/mnt/core-dump-handler/composer.log

If there are no errors can you enable debugging by setting https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L27 to Debug

Robert-Stam commented 3 months ago

I tested with k8s v1.29 on AKS (Azure) and GKE (Google), and it all resolves to 'unknown' as namespace. This is the output from the composer log on AKS

ERROR - 2024-04-05T09:41:43.149688332+00:00 - failed to create pod at index 0
ERROR - 2024-04-05T09:41:47.803435709+00:00 - Failed to get pod id

Hope this helps.

Robert-Stam commented 3 months ago

@No9 I tried to create a small PR to update the packages and crictr version, however without luck. FYI, here is my PR: https://github.com/IBM/core-dump-handler/pull/158

Do you have tried k8s v1.29 in the IBM cloud successfully?

No9 commented 3 months ago

crictl is already on the host on IKS and others so it isn't a useful test. Did you look for the compose logs as per this comment? https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2016027968

Robert-Stam commented 3 months ago

crictl is already on the host on IKS and others so it isn't a useful test.

Did you look for the compose logs as per this comment?

https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2016027968

See: https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2039364826

Robert-Stam commented 3 months ago

Do you have tried k8s v1.29 in the IBM cloud successfully?

Do you have tried k8s v1.29 in the IBM cloud successfully?

No9 commented 3 months ago

Sorry I missed your log output post for some reason. So it appears as though this command is executing but not returning a list of pods:

crictl pods  --name <hostname> -o json

where <hostname> is captured from the crashing container.

Are you overriding the hostname on the deployed workloads?

In the meantime I'll take a look at a 1.29 cluster to confirm. [Edit] Confirmed that the core dump works as expected on IBM Cloud IKS 1.29 with no additional values parameters. Tested with the following failing container.

kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never
Robert-Stam commented 3 months ago

Sorry I missed your log output post for some reason. So it appears as though this command is executing but not returning a list of pods:

crictl pods  --name <hostname> -o json

where <hostname> is captured from the crashing container.

Are you overriding the hostname on the deployed workloads?

In the meantime I'll take a look at a 1.29 cluster to confirm. [Edit] Confirmed that the core dump works as expected on IBM Cloud IKS 1.29 with no additional values parameters. Tested with the following failing container.

kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never

I am not overriding the hostname

To make sure we are on the same page, you did test with {namespace} in the filenameTemplate and that is filled out correctly?

No9 commented 3 months ago

Revalidated with this config

composer:
  ignoreCrio: false
  crioImageCmd: "img"
  logLevel: "Warn"
  filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{namespace}"

Ran kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never

The following output from the container showing the default namespace is obtained.

[2024-04-09T20:04:27Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/3fb6b86a-6726-4f5c-80fd-f34e8a971536-dump-1712693067-segfaulter-segfaulter-1-4-default.zip
[2024-04-09T20:04:27Z INFO  core_dump_agent] zip size is 28610
[2024-04-09T20:04:27Z INFO  core_dump_agent] S3 Returned: 200

Can I suggest getting a debug container on the host and establishing what happens when the following is ran: If JSON is returned can you either post it here and/or validate it in the test suite.

crictl pods  --name <hostname> -o json

Thanks [Edit] kubernetes info IBM Kubernetes Service 1.29.3_1531

Robert-Stam commented 3 months ago

@No9 Anton, I executed your command in the running container (ibm/core-dump-handler:v8.10.0) on AWS (with k8s v1.29). This is the result

[root@core-dump-lgd5p app]# ./crictl pods  --name ip-10-87-16-57.eu-west-2.compute.internal -o json
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0002] connect endpoint 'unix:///var/run/dockershim.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
ERRO[0004] connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
FATA[0006] connect: connect endpoint 'unix:///run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded

And these are the settings applied (based on the log)

[2024-04-10T09:21:31Z INFO  core_dump_agent] Writing composer .env
    LOG_LEVEL=Warn
    IGNORE_CRIO=false
    CRIO_IMAGE_CMD=img
    USE_CRIO_CONF=false
    FILENAME_TEMPLATE={namespace}-{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}
    LOG_LENGTH=500
    POD_SELECTOR_LABEL=
    TIMEOUT=600
    COMPRESSION=true
    CORE_EVENTS=false
    EVENT_DIRECTORY=/var/mnt/core-dump-handler/events
No9 commented 3 months ago

OK it looks like you are trying to run crictl from the handler container. What I was trying to suggest was setting up a debug session on the node. e.g.

kubectl get nodes 
NAME             STATUS   ROLES           AGE    VERSION
node1   Ready    master,worker   176d   v1.26.9+52589e6
node2   Ready    master,worker   176d   v1.26.9+52589e6
node3    Ready    master,worker   176d   v1.26.9+52589e6

With the node name, it doesn't matter which, run

 kubectl debug node/node1 --image=ubuntu

When you have a debug session run something like the following:

/host/usr/bin/crictl -r unix:///host/run/crio/crio.sock pods  --name core-dump-lgd5p -o json

Where /host/usr/bin/crictl is the location of wherever you have configured to copy crictl and unix:///host/run/crio/crio.sock is the crio.socket which may be in a different location and core-dump-lgd5p is the pod name

Expected output:

{
  "items": [
    {
      "id": "df2bb27cbc78c2fb51aea8cb2f9eeb6124c871244a5fb71e989458bb673125df",
      "metadata": {
        "name": "core-dump-handler-7kqc6",
        "uid": "c8ea5ce9-72be-4826-82b3-b8c3a8144d50",
        "namespace": "observe",
        "attempt": 0
      },
      "state": "SANDBOX_READY",
      "createdAt": "1712691523593249607",
      "labels": {
        "controller-revision-hash": "7b6c988b5d",
        "io.kubernetes.container.name": "POD",
        "io.kubernetes.pod.name": "core-dump-handler-7kqc6",
        "io.kubernetes.pod.namespace": "observe",
        "io.kubernetes.pod.uid": "c8ea5ce9-72be-4826-82b3-b8c3a8144d50",
        "name": "core-dump-ds",
        "pod-template-generation": "1"
      },
      "annotations": {
        "kubectl.kubernetes.io/default-container": "coredump-container",
        "kubernetes.io/config.seen": "2024-04-09T14:38:43.120492372-05:00",
        "kubernetes.io/config.source": "api",
        "openshift.io/scc": "core-dump-admin-privileged"
      },
      "runtimeHandler": ""
    }
  ]
}