Test Loki in CLAB - Githubissues

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

Test Loki in CLAB #4646

Open StevenBarre opened 5 months ago

StevenBarre commented 5 months ago

Describe the issue Current logging stack is being deprecated in future versions of OpenShift. We need to test out Loki and the migration process.

What is the Value/Impact? Keeping current with technology

What is the plan? How will this get completed?

Document current IO load on logging volumes in netapp
Read docs on the migration process
Work with storage to set up S3 access on Netapp
Migrate to Loki
Review IO load from S3
Test Loki
Update any tooling and docs
Make a plan for PROD

Identify any dependencies Storage team

Definition of done Loki running in CLAB and a plan to move forward

StevenBarre commented 2 months ago

Capture infra node cpu/memory/network usage before/after Capture collector pod cpu/memory/network usage before/after

StevenBarre commented 1 week ago

Directions to install Loki while keeping Elastic and Kibana around until all the logs in them age out https://access.redhat.com/articles/6991632

Installed in CLAB, leveraging a temp S3 endpoint on the NetApp. Storage team would prefer we use ECS when this is operationalized.

Currently running without redundancy as that was causing errors.

Will need to discuss more with Matt on tuning parameters and query formats.

StevenBarre commented 5 days ago

https://grafana.com/docs/loki/latest/query/log_queries/

StevenBarre commented 5 days ago

Case insensitive search

|~ `(?i)mystring`

StevenBarre commented 5 days ago

How to query the API via HTTP https://access.redhat.com/solutions/7046397

StevenBarre commented 4 days ago

Changed the size from 1x.demo to 1x.extra-small and that fixed replication and the PDB errors. Needed to expand onto the worker nodes as infra didn't have enough capacity while ES is still in place.

StevenBarre commented 4 days ago

Audit Log query to find Deletes by non-system users

{ log_type="audit" } | json requestURI, verb, code="responseStatus.code", user="user.username" | line_format "{{.requestURI}} {{.verb}} {{.code}} {{.user}}" | verb="delete" user!~"system:.+"

StevenBarre commented 4 days ago

Disk usage before the switch, for 2d of log retention, was 35G x3

StevenBarre commented 3 days ago

Testing logging alerts, but haven't gotten it working yet. Case opened with RH.

Some testing of queries against longer time periods shows a lot of S3 data read, could be an issue in PROD.