canonical / data-platform-workflows

Reusable GitHub Actions workflows used by the Data Platform team
Apache License 2.0
4 stars 9 forks source link

Add the service {snap, k8s} app logging to our test run outputs #133

Open phvalguima opened 9 months ago

phvalguima commented 9 months ago

Currently, we collect juju debug-log specifics, which provides insights into the charm states. We should also collect logs from the actual services, such as systemd logs for VM charms OR k8s pod logs.

That will provide insights, e.g. if we had restarting units at a given time, such as discussed in this bug.

One example of how to capture systemd logs: https://github.com/canonical/data-platform-workflows/compare/main...add-more-logs

I think we need an equivalent command for k8s as well.

github-actions[bot] commented 9 months ago

https://warthogs.atlassian.net/browse/DPE-3397

marcoppenheimer commented 9 months ago

On this, we should have a configurable path. Some snaps disable syslogs, but keep the service logs in some dedicated file.

phvalguima commented 9 months ago

I think @marcoppenheimer, in this case, we should actually improve: https://github.com/sosreport/sos

Each team should contribute its own version of a plugin: https://github.com/sosreport/sos/tree/main/sos/report/plugins That tool is a standard for support cases (even integrates natively with support portals, using --case-id option).

Right now, we can already enable the systemd and kubernetes plugins and capture the output of the main host:

$ sudo snap install sosreport --channel=latest/stable --classic
$ mkdir /home/ubuntu/sosreport
$ sudo sos report \
    --only-plugins kubernetes,systemd \
    --enable-plugins kubernetes \
    -k kubernetes.describe=true -k kubernetes.podlogs=true -k kubernetes.all=true \
    --batch \
    --clean \
    --tmp-dir=./sosreport \
    -z gzip

Where each option means:

$ sudo sos report \
    --only-plugins kubernetes,systemd \  # These are the only plugins to run
    --enable-plugins kubernetes             \  # Make sure both are enabled, e.g. k8s is disabled by default
    -k kubernetes.describe=true -k kubernetes.podlogs=true \  # options to capture pod logs and describe k8s ns
    --batch                                                   \  # No prompting
    --clean                                                   \  # Obfuscate data
    --tmp-dir=./sosreport                         \  # where the report output goes
    -z gzip                                                      # generates a tar.gz output

For LXCs, thou, we need something else. LXD plugin captures container console + LXD logs, but does not capture the journal logs within each container. Therefore, the best way here would be to run the commands above on each container, capture all the logs, and download them to a local folder in the host.

For kubernetes, you need to have kubectl installed and set the ~/.kube/config correctly before running that command.

There is no problem in running plugin kubernetes and not having k8s at all. It will just not capture any output.


In my opinion, @carlcsaposs-canonical should run the commands above by default, at least whenever a test fails. That should come alongside juju-crashdump.

Then, we can also add more plugins (kafka, opensearch, etc); as well as extra options. But I'd leave it for each team to extend sosreport and then contribute to dp-workflows here.

Wdyt?