Closed act-mreeves closed 6 months ago
Hey @act-mreeves thanks for the feedback and interest in the project.
What are some additional troubleshooting steps I can do to make sure this is setup correctly? From the logs it would appear that BottleRocket is no longer respecting the parameters that are associated with a core dump. Specifically this line from your logs is concerning:
running ["-c", "/var/mnt/core-dump-handler/crictl.yaml", "pods", "--name", "%h", "-o", "json"]
The "%h"
seems to indicate that the operating system is not replacing the parameters as expected.
In order to get this up and running I would suggest pulling back on the param parsing by removing the param parsing in the file template name:
https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L28
filenameTemplate: "{uuid}-dump"
And disable all the crio calls with https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L25
ignoreCrio: true
If that runs correctly you can add some of the parameters back into the file name to debug more.
If there is a change at the OS layer you'll have to ask AWS about that or use a custom worker node image.
Can we add any more details on how this works architecturally to improve my/our mental model? I say this mostly from ignorance.
Not sure exactly what you are looking for here. The component diagram describes how the system is deployed.
If you are looking to understand how core dumps are actually made then the How are they generated?
section of this blog post I wrote a while back should help. https://venshare.com/blog/what-is-a-core-dump/
Specifically step 4 from that article is where I think we have the issue right now.
First of all thanks for this project! I have a few questions/suggestions. For context I am running on AWS EKS 1.23 with bottlerocket. I believe this did work with AWS EKS 1.22 on bottlerocket. I have played with using the dockershim and containerd path as well as core-dump chart versions of 8.6.1 and 8.10.0 and none of the above has made a difference.
Is deleting the helm chart really necessary or should a traditional upgrade work? See https://github.com/IBM/core-dump-handler#updating-the-chart
I am running the segfault tool:
kubectl run -i -t segfaulter1 --image=quay.io/icdh/segfaulter -n default --restart=Never
. I only ever got a result once playing with different chart versions/crio settings as mentioned above.See log:
I looked at events to see which node the segfaulter landed on so I could exec into that core-dump service.
So the s3 file showed up once:
Note the odd filename where variables are not interpolated. I can delete the segfault pod and rerun it with different names and I have never gotten another successful core dump.
I can verify that if I exec into the pod and I create a file at this location:
/var/mnt/core-dump-handler/cores
it ends up in S3.I can confirm that from
/var/mnt/core-dump-handler
I can run./crictl pods
and./crictl images
and./crictl ps -a
.Notable "logs" does not work which may be due to bottlerocket?
So my question is (besides the helm upgrade one above):
Thanks again!