IBM / core-dump-handler

Save core dumps from a Kubernetes Service or RedHat OpenShift to an S3 protocol compatible object store
https://ibm.github.io/core-dump-handler/
MIT License
131 stars 40 forks source link

Additional troubleshooting advice? #142

Closed act-mreeves closed 4 months ago

act-mreeves commented 1 year ago

First of all thanks for this project! I have a few questions/suggestions. For context I am running on AWS EKS 1.23 with bottlerocket. I believe this did work with AWS EKS 1.22 on bottlerocket. I have played with using the dockershim and containerd path as well as core-dump chart versions of 8.6.1 and 8.10.0 and none of the above has made a difference.

  1. Is deleting the helm chart really necessary or should a traditional upgrade work? See https://github.com/IBM/core-dump-handler#updating-the-chart

  2. I am running the segfault tool: kubectl run -i -t segfaulter1 --image=quay.io/icdh/segfaulter -n default --restart=Never. I only ever got a result once playing with different chart versions/crio settings as mentioned above.

See log:

[root@core-dump-handler-jkdfd core-dump-handler]# tail composer.log 
 IGNORE_CRIO=false
CRIO_IMAGE_CMD=img
USE_CRIO_CONF=true
INFO - 2023-03-22T22:09:01.166806321+00:00 - Set logfile to: "\"/var/mnt/core-dump-handler/composer.log\""
DEBUG - 2023-03-22T22:09:01.166827538+00:00 - Creating dump for cluster3-dev-dump-%h-%e-%t-%p-SIG%s-bf6c0ac3-e9fb-447b-ad10-bbf4cd87e94b
DEBUG - 2023-03-22T22:09:01.166832734+00:00 - running ["-c", "/var/mnt/core-dump-handler/crictl.yaml", "pods", "--name", "%h", "-o", "json"] "/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/home/kubernetes/bin:/var/mnt/core-dump-handler"
ERROR - 2023-03-22T22:09:01.268990470+00:00 - failed to create pod at index 0
DEBUG - 2023-03-22T22:09:01.269027400+00:00 - No pod selector specified, selecting all pods
DEBUG - 2023-03-22T22:09:01.269153129+00:00 - Create a JSON file to store the dump meta data
cluster3-dev-dump-%h-%e-%t-%p-SIG%s-bf6c0ac3-e9fb-447b-ad10-bbf4cd87e94b-dump-info.json

I looked at events to see which node the segfaulter landed on so I could exec into that core-dump service.

So the s3 file showed up once:

aws s3api list-objects-v2 --bucket my-bucket  --query 'Contents[?LastModified>`2023-03-01`].Key'
[
    ... SNIP ...
    "cluster3-dev-dump-%h-%e-%t-%p-SIG%s-bf6c0ac3-e9fb-447b-ad10-bbf4cd87e94b.zip",
    ... SNIP ...
]

Note the odd filename where variables are not interpolated. I can delete the segfault pod and rerun it with different names and I have never gotten another successful core dump.

I can verify that if I exec into the pod and I create a file at this location: /var/mnt/core-dump-handler/cores it ends up in S3.

I can confirm that from /var/mnt/core-dump-handler I can run ./crictl pods and ./crictl images and ./crictl ps -a.

Notable "logs" does not work which may be due to bottlerocket?

./crictl logs ee76c9b4bdcfa
FATA[0000] failed to try resolving symlinks in path "/var/log/pods/default_segfaulter-2_28576e92-0c42-4eec-91b9-472efdcac2ac/segfaulter-2/0.log": lstat /var/log/pods: no such file or directory 

So my question is (besides the helm upgrade one above):

  1. What are some additional troubleshooting steps I can do to make sure this is setup correctly?
  2. Can we add any more details on how this works architecturally to improve my/our mental model? I say this mostly from ignorance.

Thanks again!

No9 commented 1 year ago

Hey @act-mreeves thanks for the feedback and interest in the project.

  1. What are some additional troubleshooting steps I can do to make sure this is setup correctly? From the logs it would appear that BottleRocket is no longer respecting the parameters that are associated with a core dump. Specifically this line from your logs is concerning:

    running ["-c", "/var/mnt/core-dump-handler/crictl.yaml", "pods", "--name", "%h", "-o", "json"]

    The "%h" seems to indicate that the operating system is not replacing the parameters as expected.

    In order to get this up and running I would suggest pulling back on the param parsing by removing the param parsing in the file template name:

    https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L28

    filenameTemplate: "{uuid}-dump"

    And disable all the crio calls with https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L25

    ignoreCrio: true

    If that runs correctly you can add some of the parameters back into the file name to debug more.

    If there is a change at the OS layer you'll have to ask AWS about that or use a custom worker node image.

  2. Can we add any more details on how this works architecturally to improve my/our mental model? I say this mostly from ignorance.

    Not sure exactly what you are looking for here. The component diagram describes how the system is deployed.

    If you are looking to understand how core dumps are actually made then the How are they generated? section of this blog post I wrote a while back should help. https://venshare.com/blog/what-is-a-core-dump/

    Specifically step 4 from that article is where I think we have the issue right now.