OpenCHAMI / roadmap

Public Roadmap Project for Ochami
MIT License
1 stars 0 forks source link

Logging and Events #49

Open alexlovelltroy opened 1 month ago

alexlovelltroy commented 1 month ago

Troubleshooting CSM systems and other HPC systems has taught us several lessons that we would like OpenCHAMI to benefit from. The goal of a logging and event system isn’t to surface all possible information for analysis. It is instead to help system administrators diagnose and remediate problems when they occur and to assess long term trends.

Logging and Troubleshooting contexts

Troubleshooting happens in several different contexts in an HPC system. Logging and events in the system need to support these contexts which may overlap.

Structured Logging

Standard UNIX logging relies on messages that are emitted by programs, often to the controlling shell of the process. These messages may have an internal structure, but there is no single format that all possible log messages can follow. As such, many log analysis tools have extensive customization options to identify patterns in logs and extract structured information.

Log Aggregation is necessary for some contexts, but local logs can be even more powerful for troubleshooting. Support for both eases diagnostics.

The OpenCHAMI community will develop and maintain a set of standards around logging and metrics that apply to all OpenCHAMI services that support troubleshooting, aimed at the relevant contexts. These standards must be independent of technology choices.

The OpenCHAMI community will develop and maintain standards for infrastructure to support Logging and Metrics at various scale levels as well as conformance tests that allow sites to validate that a solution meets OpenCHAMI specifications.

alexlovelltroy commented 1 week ago

This is related to #7