NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Support for core dumps #91

Open bdevcich opened 10 months ago

bdevcich commented 10 months ago

We need address the handling of core dumps on our rabbit nodes, particularly concerning running applications within the NNF software. These applications, which mainly encompass data movement and user containers, are executed on K8s pods that target rabbit nodes. These pods are susceptible to experiencing segfaults during their operation.

For data movement, running mpirun and dcp (unless configured otherwise) could produce segfaults, while user containers introduce even broader possibilities.

To ensure convenient access to core dumps in either scenario, we should implement a streamlined solution. Here's something to get started/discuss:

  1. Allocating a section of the rabbit's M.2 drive specifically for this purpose. For this discussion, let's assume that filesystem is mounted at /mnt/coredumps.

  2. Configure the rabbit nodes to deposit core dumps into this directory, which the containers will pick up at /proc/sys/kernel/core_pattern.

  3. For containers running on rabbit nodes, ensure that the host's /mnt/coredumps is mounted in the containers.

  4. For data movement, we'll need to build debug images that include versions of openmpi and dcp that are built with symbols so the core dump can be analyzed.

Additionally, there is an operator that handles coredumps, so some of the above might be moot: https://github.com/IBM/core-dump-handler. Here's an article on it as well: https://cloud.redhat.com/blog/a-guide-to-core-dump-handling-in-openshift. I have not yet looked into these solutions.

For HPE staff, there will still be a logistical challenge for analyzing the core dump. We currently do not have permissions to exec into containers or run podman on the login nodes. This limits the way we can perform analysis of the core dump if we cannot get the core dump onto our own machines or be able to run debugging images to analyze the core dump. More investigation of this process is necessary and probably warrants its own issue.