Open wsszh opened 2 years ago
Hey, I also got this unknown. How did you solve it?
We see his behaviour here as well (on AWS) any news? I see the issue is set the closed, however it doesn't seems to be resolved?
Hey @Robert-Stam
Can you confirm which aws.values.xxx.yaml
you have used in the deployment and which version of EKS you are using.
It's likely that the version of crio is now outdated as this hasn't been updated for a while.
Hey @Robert-Stam Can you confirm which
aws.values.xxx.yaml
you have used in the deployment and which version of EKS you are using. It's likely that the version of crio is now outdated as this hasn't been updated for a while.
I have used the settings from: https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.aws.yaml
We are using Kubernetes 1.28 (on Intel hardware, m6i family) with the AMI: amazon-eks-node-1.28-v20240110 See: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240110
Thanks in advance!
@No9 Hi Anton, any update on this?
I don't have access to an AWS account to debug. Can you log into an agent container that has processed a core dump and provide the output of
cat /var/mnt/core-dump-handler/composer.log
If there are no errors can you enable debugging by setting
https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L27
to Debug
I tested with k8s v1.29 on AKS (Azure) and GKE (Google), and it all resolves to 'unknown' as namespace. This is the output from the composer log on AKS
ERROR - 2024-04-05T09:41:43.149688332+00:00 - failed to create pod at index 0
ERROR - 2024-04-05T09:41:47.803435709+00:00 - Failed to get pod id
Hope this helps.
@No9 I tried to create a small PR to update the packages and crictr version, however without luck. FYI, here is my PR: https://github.com/IBM/core-dump-handler/pull/158
Do you have tried k8s v1.29 in the IBM cloud successfully?
crictl is already on the host on IKS and others so it isn't a useful test. Did you look for the compose logs as per this comment? https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2016027968
crictl is already on the host on IKS and others so it isn't a useful test.
Did you look for the compose logs as per this comment?
https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2016027968
See: https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2039364826
Do you have tried k8s v1.29 in the IBM cloud successfully?
Do you have tried k8s v1.29 in the IBM cloud successfully?
Sorry I missed your log output post for some reason. So it appears as though this command is executing but not returning a list of pods:
crictl pods --name <hostname> -o json
where <hostname>
is captured from the crashing container.
Are you overriding the hostname on the deployed workloads?
In the meantime I'll take a look at a 1.29 cluster to confirm. [Edit] Confirmed that the core dump works as expected on IBM Cloud IKS 1.29 with no additional values parameters. Tested with the following failing container.
kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never
Sorry I missed your log output post for some reason. So it appears as though this command is executing but not returning a list of pods:
crictl pods --name <hostname> -o json
where
<hostname>
is captured from the crashing container.Are you overriding the hostname on the deployed workloads?
In the meantime I'll take a look at a 1.29 cluster to confirm. [Edit] Confirmed that the core dump works as expected on IBM Cloud IKS 1.29 with no additional values parameters. Tested with the following failing container.
kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never
I am not overriding the hostname
To make sure we are on the same page, you did test with {namespace} in the filenameTemplate and that is filled out correctly?
Revalidated with this config
composer:
ignoreCrio: false
crioImageCmd: "img"
logLevel: "Warn"
filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{namespace}"
Ran kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never
The following output from the container showing the default
namespace is obtained.
[2024-04-09T20:04:27Z INFO core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/3fb6b86a-6726-4f5c-80fd-f34e8a971536-dump-1712693067-segfaulter-segfaulter-1-4-default.zip
[2024-04-09T20:04:27Z INFO core_dump_agent] zip size is 28610
[2024-04-09T20:04:27Z INFO core_dump_agent] S3 Returned: 200
Can I suggest getting a debug container on the host and establishing what happens when the following is ran: If JSON is returned can you either post it here and/or validate it in the test suite.
crictl pods --name <hostname> -o json
Thanks
[Edit]
kubernetes info IBM Kubernetes Service 1.29.3_1531
@No9 Anton, I executed your command in the running container (ibm/core-dump-handler:v8.10.0) on AWS (with k8s v1.29). This is the result
[root@core-dump-lgd5p app]# ./crictl pods --name ip-10-87-16-57.eu-west-2.compute.internal -o json
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0002] connect endpoint 'unix:///var/run/dockershim.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
ERRO[0004] connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
FATA[0006] connect: connect endpoint 'unix:///run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
And these are the settings applied (based on the log)
[2024-04-10T09:21:31Z INFO core_dump_agent] Writing composer .env
LOG_LEVEL=Warn
IGNORE_CRIO=false
CRIO_IMAGE_CMD=img
USE_CRIO_CONF=false
FILENAME_TEMPLATE={namespace}-{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}
LOG_LENGTH=500
POD_SELECTOR_LABEL=
TIMEOUT=600
COMPRESSION=true
CORE_EVENTS=false
EVENT_DIRECTORY=/var/mnt/core-dump-handler/events
OK it looks like you are trying to run crictl from the handler container. What I was trying to suggest was setting up a debug session on the node. e.g.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready master,worker 176d v1.26.9+52589e6
node2 Ready master,worker 176d v1.26.9+52589e6
node3 Ready master,worker 176d v1.26.9+52589e6
With the node name, it doesn't matter which, run
kubectl debug node/node1 --image=ubuntu
When you have a debug session run something like the following:
/host/usr/bin/crictl -r unix:///host/run/crio/crio.sock pods --name core-dump-lgd5p -o json
Where /host/usr/bin/crictl
is the location of wherever you have configured to copy crictl and unix:///host/run/crio/crio.sock
is the crio.socket which may be in a different location and core-dump-lgd5p
is the pod name
Expected output:
{
"items": [
{
"id": "df2bb27cbc78c2fb51aea8cb2f9eeb6124c871244a5fb71e989458bb673125df",
"metadata": {
"name": "core-dump-handler-7kqc6",
"uid": "c8ea5ce9-72be-4826-82b3-b8c3a8144d50",
"namespace": "observe",
"attempt": 0
},
"state": "SANDBOX_READY",
"createdAt": "1712691523593249607",
"labels": {
"controller-revision-hash": "7b6c988b5d",
"io.kubernetes.container.name": "POD",
"io.kubernetes.pod.name": "core-dump-handler-7kqc6",
"io.kubernetes.pod.namespace": "observe",
"io.kubernetes.pod.uid": "c8ea5ce9-72be-4826-82b3-b8c3a8144d50",
"name": "core-dump-ds",
"pod-template-generation": "1"
},
"annotations": {
"kubectl.kubernetes.io/default-container": "coredump-container",
"kubernetes.io/config.seen": "2024-04-09T14:38:43.120492372-05:00",
"kubernetes.io/config.source": "api",
"openshift.io/scc": "core-dump-admin-privileged"
},
"runtimeHandler": ""
}
]
}
Hi, I set the filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{podname}-{namespace}", but I get the filename like this: "9a1fc79c-758c-4599-a22d-2e94444a3250-dump-1657867608-segfaulter-segfaulter-1-4-unknown-unknown.zip". How to fix it?