Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
65 stars 53 forks source link

Log collector #989

Open matt-chan opened 2 years ago

matt-chan commented 2 years ago

In what area(s)?

/area administration /area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image /area job-scheduling

/area monitoring

/area ood /area remote-visualization /area user-management

Describe the feature

It would be nice if we could view text logs without having to log into the individual VMs. We are having an issue with the slurm scheduler and it's quite hard to debug without seeing the logs. If Grafana Loki were configured it would be really nice.

cc @ltalirz

xpillons commented 2 years ago

Thanks for the suggestion, having a log collector is definitively a nice to have

matt-chan commented 2 years ago

If this gets implemented, it would be really nice if it also grabbed the AD logs from the windows server. It's super hard to debug AD failures so that would be a really common use-case

ltalirz commented 2 years ago

Hi @xpillons , we are realizing more and more that we need a log collector to run az-hop reliably and to troubleshoot issues effectively as they arise. We may be able to find some time to contribute a prototype implementation.

If you have any specific suggestions / hints / things to keep in mind, please let us know.

xpillons commented 2 years ago

if possible the goal is to leverage Azure Monitor and Log Analytics. This will avoid having to maintain another infrastructure for log analysis.

ltalirz commented 2 years ago

Just mentioning that this is also a severity "high" recommendation by the monitoring advisor

"Log Analytics agent should be installed on virtual machines"

Defender for Cloud collects data from your Azure virtual machines (VMs) to monitor for security vulnerabilities and threats. Data is collected using the Log Analytics agent, formerly known as the Microsoft Monitoring Agent (MMA), which reads various security-related configurations and event logs from the machine and copies the data to your Log Analytics workspace for analysis. This agent is also required if your VMs are used by an Azure managed service such as Azure Kubernetes Service or Azure Service Fabric. We recommend configuring auto-provisioning to automatically deploy the agent. If you choose not to use auto-provisioning, manually deploy the agent to your VMs using the instructions in the remediation steps.

xpillons commented 2 years ago

Will look into this