Error generating reports

chzgustavo commented 1 year ago

Hello, I am using this tool, congratulations it is very good, but I have noticed that when a segment fault is generated, it sometimes generates all the files with another namespace name.

I attach evidence.

cluster: EKS v1.21
core dump version:

NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                        APP VERSION
core-dump-handler       observe         1               2022-07-01 04:33:50.377219926 +0000 UTC deployed        core-dump-handler-v8.6.0     v8.6.0

pod-info file: it contains the namespace env-1f1de3e2bda8, when in fact this pod is in the namespace: env-e4e2facbcb22

It occurs to me to update core-dump to the newest version, I don't know if this will solve this problem.

chzgustavo commented 1 year ago

Do you have any idea how I could debug this error?

Regards, Gustavo.

No9 commented 1 year ago

Hi @chzgustavo Thanks for the feedback really appreciate it.

Do you have pods with the same name running in different namespaces?

Background

The information from crio is currently queried using the hostname of the crashing container which is assumed to be unique.

This container hostname is then used to match to the pod. https://github.com/IBM/core-dump-handler/blob/main/core-dump-composer/src/main.rs#L75

It isn't ideal but using the hostname is the only way to try and catch the crashing container information that I am aware of.

This isn't an issue in most deployment scenarios as people tend to use replicasets/deployments that generates a unique id for each pod.

However if you are creating pods directly in each namespace then you may have the potential to hit a name clash issue.

Possible Solution

If that sounds like the problem I would suggest giving each pod a unique name when provisioning.

chzgustavo commented 1 year ago

Yes, indeed, I have many pods with the same name running in different namespaces. The pods that generate segment fault belong to statefulset resources.

chzgustavo commented 1 year ago

They all have the same hostname (but they are in different namespaces), is there any other possible solution for this case? Thanks for your help!

No9 commented 1 year ago

Sorry I'm not aware of another possible solution.

Statefulsets intentionally label the pods with ordinal numbers https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-identity.

If you're using helm you can add the namespace to the statefulset name which would resolve this. I know it's clunky but it should resolve it handily enough.

The underlying issue here is that the container kernel.core_pattern is per host and not per container so it's not possible to feed dynamic info from the pod to the kernel at runtime.

As systemd becomes more pod aware there may be a possibility to do something there but the last time I looked it just seemed to pass through to the system code.

[Edit] I will add this to the FAQ as it seems like it would be a fairly common scenario that will trip others up.

[Edit2] I'll double check the statuses in the responses from CRIO it may be possible to detect if the pod is crashing and if it isn't then move on to the next pod. I seem to remember looking at this when I wrote it and it wasn't possible but I'll double check. I won't get to that for a bit though as I have to look at #114 first.

jesuslinares commented 1 year ago

Hi @No9,

Thanks for the information. We continue with this bug in production since we didn't apply the "clunky workaround".

Did you make some progress to fix it?

This project is very useful for us, thanks for the good work.

IBM / core-dump-handler

Error generating reports #115

Background

Possible Solution