cloudfoundry / gorouter

CF Router
Apache License 2.0
441 stars 226 forks source link

Replace zap with slog; Replace logging wrappers with slog #435

Closed hoffmaen closed 1 month ago

hoffmaen commented 2 months ago

Summary

gorouter uses a very outdated zap version from 2016 for logging. This Pull-Request introduces required code changes to use slog as the logging frontend, together with a zap handler via zapslog. Slog provides a stable frontend, while we keep using zap with its configuration as the backend.

Both the gorouter logger and the lager logger (used for initializing the UAA Token Fetcher) used zap logger wrappers. This Pull-Request replaces the logger wrapper with direct use of slog.Logger. A thin library is provided to keep it compatible with the code. The library provides the functions CreateLogger and CreateLoggerWithSource to create a logger that subsequently writes data into the data field, and optionally sets the source field in the log's root. The library also has functions to dynamically set the WriteSyncer, which is required for tests, and to dynamically set the logging level and timestamp format.

With the removal of the Logger interface, all occurrences of the FakeLogger have been replaced with an slog.Logger based TestLogger.

Additionally, some test refactoring and code reformatting has been done.

Backward Compatibility

Breaking Change? No

This Pull-Request requires https://github.com/cloudfoundry/routing-release/pull/434 for updating zap dependencies.

The zap handler configuration has been kept, so the log message format does not change. Workarounds for differing behavior of the slog.Logger frontend have been found and implemented. There are quite a lot of tests on log messages in the code, an additional end-to-end test with a deployment of this code did not show any differences compared to the old implementation.

ameowlia commented 1 month ago

I was able to validate the fixes for "tls-listener-started" and "tcp-listener-started".

😞 But I found another discrepancy, with the "endpoint-not-unregistered", "endpoint-registered", and "endpoint-unregistered" logs. All of them had the same capitalization discrepancy.

BEFORE

{
  "log_level": 1,
  "timestamp": "2024-09-25T15:17:24.679442106Z",
  "message": "endpoint-not-unregistered",
  "source": "vcap.gorouter.registry",
  "data": {
    "uri": "api.cove-3156153.z60a1041a.shepherd.lease/routing",
    "route_service_url": "",
    "backend": "10.0.1.5:3001",
    "application_id": "",
    "instance_id": "9b280680-4fcd-4e11-7a48-38b251758191",
    "server_cert_domain_san": "routing-api.service.cf.internal",
    "protocol": "http1",
    "modification_tag": {
      "guid": "",    # <--- note the lack of capitalization here for guid
      "index": 0  # <--- note the lack of capitalization here for index
    },
    "isolation_segment": "-",
    "isTLS": true
  }
}

AFTER

{
  "log_level": 1,
  "timestamp": "2024-09-27T14:36:32.792195379Z",
  "message": "endpoint-not-unregistered",
  "source": "vcap.gorouter.registry",
  "data": {
    "uri": "api.near-3161542.z8dd11304.shepherd.lease/routing",
    "route_service_url": "",
    "backend": "10.0.1.6:3001",
    "application_id": "",
    "instance_id": "4493b7d4-e400-4dd0-7da4-4c607290083d",
    "server_cert_domain_san": "routing-api.service.cf.internal",
    "protocol": "http1",
    "modification_tag": {
      "Guid": "",  # <--- note the capitalization here for Guid
      "Index": 0  # <--- note the capitalization here Index
    },
    "isolation_segment": "-",
    "isTLS": true
  }
}
ameowlia commented 1 month ago

This is how I am doing my comparison by the way...

  1. I run two environments through the same actions:
    • push an app
    • stop nats
    • start nats
    • restart the router vm
    • restart a diego cell vm
    • delete the app
  2. Then I get all of the uniq log messages jq .message LOG_FILE | sort | uniq
  3. Then I use this script to compare the log files for each one of those messages...
#!/bin/bash
messages=(
"Draining with 0 outstanding active connections"
"Draining with 24 outstanding idle connections"
"Successfully-connected-to-nats"
"Successfully-subscribed-to-routing-api-event-stream"
"endpoint-not-unregistered"
"endpoint-registered"
"endpoint-unregistered"
"error-stopping-route-services-server"
"exited"
"failed-closing-health-tls-listener"
"failed-getting-next-event"
"gorouter-draining"
"gorouter.exited"
"gorouter.started"
"gorouter.stopped"
"gorouter.stopping"
"nats-connection-disconnected"
"nats-connection-reconnected"
"nats-connection-still-disconnected"
"prune-suspended"
"retrieved-isolation-segments"
"route-registered"
"route-unregistered"
"setting-up-nats-connection"
"setting-up-routing-api"
"stopping"
"subscriber-started"
"subscriber-starting"
"subscribing-to-routing-api-event-stream"
"syncer-started"
"tcp-listener-started"
"tls-listener-started"
"validation-error"
)

old_file="LOGFILE"
new_file="LOGFILE"
old_log_file="/tmp/old-log"
new_log_file="/tmp/new-log"

for i in "${messages[@]}"
do
    echo
    echo
    echo $i

    ag "${i}" $old_file --silent --nonumbers | head -n 1 | jq . > $old_log_file
    ag "${i}" $new_file --silent --nonumbers | head -n 1 | jq . > $new_log_file
    echo "------------------"

    diff $old_log_file $new_log_file
done

If the logs match it should only be the timestamp/URI/IP that are different like this...

validation-error
------------------
<   "timestamp": "2024-09-25T11:04:49.850291246Z",
---
>   "timestamp": "2024-09-27T14:45:00.824626396Z",

or

route-unregistered
------------------
<   "timestamp": "2024-09-25T15:17:47.874868993Z",
---
>   "timestamp": "2024-09-27T14:36:35.878030926Z",

<     "uri": "uaa.cove.meow.com"
---
>     "uri": "uaa.near.meow.com"