Troubleshooting CSM systems and other HPC systems has taught us several lessons that we would like OpenCHAMI to benefit from. The goal of a logging and event system isn’t to surface all possible information for analysis. It is instead to help system administrators diagnose and remediate problems when they occur and to assess long term trends.
Logging and Troubleshooting contexts
Troubleshooting happens in several different contexts in an HPC system. Logging and events in the system need to support these contexts which may overlap.
Job Context: HPC systems exist to run jobs. Remediation at this level is urgent and important.
Node Context: When a node is functioning at some differential from its peers, addressing the variance is important, but not urgent unless it interferes with Jobs. Troubleshooting why a node isn’t booting is included here.
System Context: System wide issues that are not tied to the functioning of a single compute node are commonly precursors to Job related issues. Troubleshooting them falls into the Urgent and Important quadrant.
Control Plane Context: The management system itself must be more resilient to failures than any individual node or job. Troubleshooting problems in this context should be neither important, nor urgent. However, left long enough, they will escalate to impact Jobs.
Analytical Context: When addressing performance and behavior issues that are only clear with large datasets over time, the analytical toolset is different from the immediate troubleshooting toolset.
Structured Logging
Standard UNIX logging relies on messages that are emitted by programs, often to the controlling shell of the process. These messages may have an internal structure, but there is no single format that all possible log messages can follow. As such, many log analysis tools have extensive customization options to identify patterns in logs and extract structured information.
Log Aggregation is necessary for some contexts, but local logs can be even more powerful for troubleshooting. Support for both eases diagnostics.
The OpenCHAMI community will develop and maintain a set of standards around logging and metrics that apply to all OpenCHAMI services that support troubleshooting, aimed at the relevant contexts. These standards must be independent of technology choices.
The OpenCHAMI community will develop and maintain standards for infrastructure to support Logging and Metrics at various scale levels as well as conformance tests that allow sites to validate that a solution meets OpenCHAMI specifications.
Troubleshooting CSM systems and other HPC systems has taught us several lessons that we would like OpenCHAMI to benefit from. The goal of a logging and event system isn’t to surface all possible information for analysis. It is instead to help system administrators diagnose and remediate problems when they occur and to assess long term trends.
Logging and Troubleshooting contexts
Troubleshooting happens in several different contexts in an HPC system. Logging and events in the system need to support these contexts which may overlap.
Structured Logging
Standard UNIX logging relies on messages that are emitted by programs, often to the controlling shell of the process. These messages may have an internal structure, but there is no single format that all possible log messages can follow. As such, many log analysis tools have extensive customization options to identify patterns in logs and extract structured information.
Log Aggregation is necessary for some contexts, but local logs can be even more powerful for troubleshooting. Support for both eases diagnostics.
The OpenCHAMI community will develop and maintain a set of standards around logging and metrics that apply to all OpenCHAMI services that support troubleshooting, aimed at the relevant contexts. These standards must be independent of technology choices.
The OpenCHAMI community will develop and maintain standards for infrastructure to support Logging and Metrics at various scale levels as well as conformance tests that allow sites to validate that a solution meets OpenCHAMI specifications.