Support for application level artifacts

splitice commented 1 year ago

Would it be possible to mark a emptyDir volume for collection in place native core dumps (as well)?

The reason I ask is that we also want to collect heapdumps made on OOM from nodejs (there are probably other applications that exhibit similar behaviour). Node does this with --heapsnapshot-near-heap-limit and could be configured to output to an emptyDir volume or similar.

My theory would be to have core-dump-handler responsible for uploading any files to it's object store that it finds in these folders along with some basic metadata.

Thoughts? Aligned enough with the goals of this project, or something seperate?

No9 commented 1 year ago

Hey @splitice Support for this type of feature was always intended but never completed as it's a little more complex than if first looks.

The first challenge was redirecting the output of --heapsnapshot-near-heap-limit but this seems to be completed https://github.com/nodejs/node/issues/39493

The next challenge is:

how to share the heapsnapshot with an uploader?

As you are probably aware this project "Just works" by configuring the kernel.core_pattern of each k8s node so that every crashing pod will pass through to the composer without any additional configuration in the workload pod. It manages the privilege escalation of configuring the node outside of the deployed workload pod. In the case of heapsnapshot I don't think we are able to rely on the operating system facilities on the k8s node so we will have to define another method.

Here are the options i've thought about there may be others so feel free to add to the list:

EmptyDir

The emptyDir option creates storage that only lasts as long as the workload pod although it does survive crashes. Today you can use a sidecar attached at pod startup to monitor the heapdump location and upload the heapdump. (FYI I have investigated debug container but I don't think they help here as it would need to be signaled some how)

The sidecar approach adds additional workload and complexity to each deployment and could possibly be improved by extending the coredump agent to setup INotify events on predefined locations within the container files system in /etc/run.

The monitoring could be set up when either specific annotated pods are created or preferably when pods with a specific named emptyDir is created. There is a sample rust pod watcher as part of kube-rs that should help as a starting poing https://github.com/kube-rs/kube/blob/main/examples/pod_watcher.rs

This is quite complex but has the benefits of running on plain kubernetes without additional privileges or storage configurations than is currently required. I think this approach merits further investigation.

HostPath

The HostPath option is more straight forward as the heapdump could be stored beyond the lifecycle of the pod and the agent could be extended to monitor an additional folder. This is more inline with how the current agent works but as it requires the workload pods to have elevated privileges I don't think it's the best approach.

ReadWriteMany

Similar to the HostPath option its possible to assign the workload pod and the agent pod a shared ReadWriteMany volume. This has the benefit of not requiring privileged access to the host but will require RWX compatible storage in the k8s environment which may not be available by default in some common scenarios. If the emptydir scenario doesn't work then this would be my preferred next option.

Adding @mhdawson as we discussed the status of dump support in node recently and he may have found other options.

splitice commented 1 year ago

@No9 I love your response. Everything I was thinking (but in substantially more depth).

My opinion would be to avoid ReadWriteMany, it's simply not available for every K8s platform.

EmptyDir would be ideal if no other limitations prevent implementation. I think the lifecycle limitation could be avoided by only supporting multi container pods, these should keep the emptyDir around on a container restart (according to my understanding)

Otherwise I think hostpath is actually possible without elevated permissions (as long as the parent dir is chmod 0777) but I might be wrong.

One fourth option (as perhaps a last resort) would be a custom csi driver. We have one for overlay filesystem mounts and suspect a custom mount (fuse or tmpfs) accessible to the collector could also work.

No9 commented 1 year ago

Thanks @splitice It was a bit of a brain dump but glad it made sense.

Agreed ReadWriteMany config is a heavy requirement.

Thinking about it further I'm not sure about approaching EmptyDir by keeping the file system alive with sidecar pods or for uploading the heapdump. Especially when it comes to memory issues as k8s could evict the whole pod not just the individual containers and then we've lost the heapdump.

No matter what the permissions on the folder HostPaths in OpenShift require specific security exceptions as they are an attack vector. The general position across the k8s eco-system is they shouldn't be used for regular workloads if possible and I think they should be avoided.

This project used to use an object storage fuse when it first started It was very platform specific and while there are providers for most clouds consolidating them into a single release was beyond the scope of this project. I'm not sure if there are better scenarios now with CSI drivers but open to hearing about them.

Depending on what we can find out about CSI my current approach would be:

Workload pod mounts creates an annotated PVC for PV that is sized based on the memory of the container and the heap config. As we don't want k8s evicting based on the heapdump being called we will need to understand how node will need to be configured with respect to the heapsize and the to the resource constraints of the pod. https://nodejs.dev/en/api/v18/cli#--heapsnapshot-near-heap-limitmax_count https://nodejs.dev/en/api/v18/cli#--max-old-space-sizesize-in-megabytes https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
The agent watches for annotated PVCs being created and sets up a monitors for them to watch for changes.
When a heapdump is generated the agent will map the heapdump to the pod and image info similar to how it's currently done in the composer and zip the content.
The agent then uploads it to the object storage location.

Now I have written it out if we took the approach of 1-4 there is no obvious benefit of hosting it in this project as the zip and info file creation currently relies on access to the host crictl which wouldn't be available so would have to be rewritten although the upload config may be useful.

This probably needs a bit more research as we should look to see if there is prior art that might be useful in things like logging agents.

IBM / core-dump-handler