OpenCHAMI / roadmap

Public Roadmap Project for Ochami
MIT License
0 stars 0 forks source link

[RFD] Logging infrastructure and version control server integration? #7

Open qwofford opened 7 months ago

qwofford commented 7 months ago

Some points to agree/disagree on, and discuss:

Not every log should trigger issues or work requests, but some could be useful. It would be interesting to discuss as the project takes shape.

alexlovelltroy commented 7 months ago

I'll endorse the idea of GitOps for cluster configuration changes, but we need to be careful of tight couplings.

We've already got Gitlab pipelines for creating system images and uploading them to an S3-like bucket, but we haven't integrated that directly into the ochami codebase because it is one of many ways to create a system image. The integration point is that our boot management system needs the ability to point to a system image with a URL that includes a protocol and a descriptor. If we choose to switch to DVS or NFS for root filesystems, there are no BSS/SMD changes required. If we choose to use IMS with CFS for image build and management, there are no BSS/SMD changes necessary. The same is true of operating system choices and network management tools.

These loose couplings allow different pieces of the software stack to evolve and change at different rates without knowledge of the other parts of the system. Enforcing that all configuration changes must happen through a git server would be a tight coupling. However, there are design choices we can make to ensure that our APIs can be called from an unattended script that runs within a Gitlab runner. I'm fully in favor of that.

In my opinion, our tooling should allow sites to make their own choices about how to handle change management. That includes how and when to trigger actions based on events from the system. If our logging contact ensures that ochami can provide a unified, namespaced feed of logs and events, the site can then decide what log/event processing tool they are most comfortable with. The site can also decide how to apply things like pattern recognition or even AI to further aid and automate troubleshooting and remediation.