abrensch / brouter

configurable OSM offline router with elevation awareness, Java + Android
MIT License
494 stars 117 forks source link

RFC: Improvement to Logging #702

Open mjaschen opened 4 months ago

mjaschen commented 4 months ago

Following #699 I'd like to discuss some more improvements to the log output of BRouter server.

There are three things I'd like to address:

  1. separation between stdout and stderr outputs
  2. structure of the log messages
  3. introduction of (optional) structured logging

These are just some ideas scribbled down. Before going any deeper, I'd like to hear your thoughts on this.

Separation between stdout and stderr outputs

Currently BRouter emits messages to stdout and stderr.

Sometimes messages related to each other are printed to different streams (e.g. error message printed to stdout, followed by a stacktrace printed to stderr).

A clear separation between regular log messages (i.e. routing requests or profile uploads) and informational/error messages (e.g. startup messages, usage information, errors, stacktraces, debug output if enabled) would be beneficial.

Structure of the log messages

I've tried to describe the structure of the log messages in the README.

I think there's some room for improvement here, mainly removing redundant information which has to be filtered out in later processing anyway.

I'd like to propose a structure that's very close to the current format, with some small changes:

<timestamp> <session count> <client ip> <processing time> <request method> <request query string> <http version> <response status> <response size>
Field Description Changes
timestamp The current timestamp in ISO 8601 format, unchanged
session count new for newly detected client or counter (positive integer) no more capping at 999,
remove padding
client ip IP address of client remove ip= prefix
processing time processing duration in ms remove ms= prefix
request method HTTP method
request URL the plain request path + query string
http version HTTP version from request header, e.g. HTTP/1.1
response status HTTP reponse status code, e.g. 200, 500
reponse size size of the response in bytes

An example of a log message would look like this:

2024-05-14T21:11:26.499+02:00 new 127.0.0.1 189 GET /brouter?lonlats=13.377485,52.516247%7C13.351221,52.515004&profile=trekking&alternativeidx=0&format=geojson HTTP/1.1 200 1234
2024-05-14T21:11:33.229+02:00 1 127.0.0.1 65 GET /brouter?lonlats=13.377485,52.516247%7C13.351221,52.515004&profile=trekking&alternativeidx=0&format=geojson HTTP/1.1 200 1234

All fields are separated by a single space character. The encoding of the query string depends on the client, so we should ensure that the values logged are always in URL-encoded (percent-encoding, RFC3986) form. This way, no additional spaces have to be considered when parsing log messages.

Introduction of optional structured logging

For easier processing of log messages, I'd like to propose introducing structured logging, starting with regular log output at stdout. This would allow to easily parse the log messages and extract any relevant information.

To keep backward compatibility, this should be an optional feature.

The switch to structured logging from the legacy format would be controlled by an environment variable.

Another way would be to introduce a new command line option, but as BRouter's argument parsing is somewhat restricted (exact order is required, new arguments can only be appended), I'd prefer the environment variable approach.

JSON is being used as de facto standard format for structured logging nowadays.

I'd like to propose a format following this example:

{
    "schema": "brouter-1.0",
    "timestamp": "2021-07-07T12:00:00Z",
    "client": {
        "new": true,
        "ip": "127.0.0.1"
    },
    "session_count": 42,
    "processing_time": 23,
    "request": {
        "query": {
            "lonlats": [
                [8.0, 50.0],
                [8.1, 50.1]
            ],
            "format": "geojson",
            "profile": "trekking",
            "alternativeidx": 0,
            "nogos": [
                [8.3, 50.3, 400],
                [8.4, 50.4, 200]
            ]
        }
    }
}

Each log message is a JSON string. Line breaks are shown only for readability; server will emit the entire log record in one line.

More fields can be added as needed without breaking existing log processing.

The schema field is used to indicate the version of the structured logging format (helpful if breaking changes need to be introduced).

All other fields are just a different representation of the existing log message structure described above. lonlats and nogos arrays contain the points in the same order as in the request query string.

The exact format would be described in a JSON schema file.

msb48 commented 4 months ago

separation between stdout and stderr outputs

+1

remove HTTP method (GET), remove HTTP version (HTTP/1.1)

I've seen and appreciated those entries with other servers too, so I'd prefer not to remove them. In fact, it might make sense to add two more standard entries:

remove path components (/brouter?)

That would be too much of a cleanup IMO. There might very well be different HTTP endpoints served from the same binary in the future (perhaps even today? - haven't checked), e.g. a v2, or additional helper services.

IP address of client

https://brouter.de/privacypolicy.html (as linked from https://brouter.de/brouter-web) returns 404. For GDPR compliance it should be clearly stated for what reasons and for how long that data is retained, though. See https://bikerouter.de for how it could be done properly.

Before that is fixed, storing the IP should be disabled immediately. IOW, logging the IP should be off by default, or it should be replaced by a static placeholder. Logging the real IP should be an option (config or env), to be enabled only by admins who know what they are doing (see paragraph above).

Is logging the IP even needed, or would a hash be enough?

abrensch commented 4 months ago

Hallo marcus+msb48,

restored the https://brouter.de/privacypolicy.html ...

difficult for me to comment the RFC if you do not say what you plan to do with the logs. As I sayd, I am doing just some access statistics and quality assurance for the quality of service for the brouter.de instance. Here, parallel sessions summer peak is about 200 (so capping at 999 theoretical so far) and requests-per-day summer peak is about 400.000

Yesterday, however, requests per day was 900.000 and looking at the log shows what happened:

18.05.24 11:47 127 ip=147.45.47.133 ms=1 -> GET /brouter/suspects/Austria/Vienna/confirmed/843514405793336318/fixed?ndays=30-1))%20OR%20938=(SELECT%20938%20FROM%20PG_SLEEP(15))-- HTTP/1.1

Lessons learned her: 1) more then just standard routing requests in that log, 2) intrusion detection needs unformatted infoprmation, 3) IP adress needed to add it on the blacklist

However, for context you need to know there's the nginx access/error logs in addition, that do contain the IP and the URL as well. On brouter.de, they are setup for daily rotate/gzip and 2 weeks archive

brouter-logging up to now is rotated manually (which does no work well)

So depends, if you want to have a long-time archive of access statistics (and I would like to have..) that should be some hybrid that works in conjunction with the nginx-log, having the brouter-log free of IPs for GDPR compliance.

Hard to believe for me that you will get happy here with structured JSON when talking about 100 Mio Requests per year. Other Aspect here is that if the JSON comes with a library dependency then it comes with a price... ((Geo)JSON up to now we create low-level)

So maybe Marcus you can comment on the intended usage of the log?

regards, Arndt

mjaschen commented 4 months ago

Hi @abrensch,

my motivation for introducing structured logging is to analyze more efficiently. The structured logs can be directly imported into modern log management systems like Elasticsearch and then searched, analyzed, aggregated, or visualized with frontends like Kibana, Graylog, Grafana, etc.

Examples:

All of this is possible with plain text logs too, but it requires more effort to parse the logs and extract the relevant information.

I'm using Elasticsearch, Graylog and Kibana for other projects. Depending on the time of day, hundreds to thousands of messages are ingested every second on some small virtual servers. So performance is definitely not an issue. 100M messages would require only a few gigabytes of storage, which is negligible nowadays.

As the log format is relatively simple, building the messages manually would not be a hard task if no further external dependencies are allowed.

Of course it'd be easily possible to pick up the existing log messages and convert them to structured logs by an external tool or script. Integration into BRouter is not necessarily important to me, but I thought it might also be useful for other server operators.

Re IP addresses/GDPR:

Logging, identifying, and blocking rogue clients are better done at least one level above BRouter, for example, in the reverse proxy, before the request reaches the BRouter server.

This way, the IP address can easily be printed in a hashed form to BRouter's log messages (even the session pool could work with the hashes instead of plain IP addresses).

In my development setup, I've already implemented logging hashes: SHA256(IP address + User-Agent), then I extract the first 10 characters. This way, each client remains unique but is not traceable to an IP address:

2024-05-19T10:46:18.273+02:00 new a9f14018d5 167 GET /?lonlats=13.377485,52.516247%7C13.351221,52.515004&profile=trekking&alternativeidx=0&format=geojson HTTP/1.1
2024-05-19T10:46:27.256+02:00   1 a9f14018d5 90 GET /?lonlats=13.377485,52.516247%7C13.351221,52.515004&profile=trekking&alternativeidx=0&format=geojson HTTP/1.1
2024-05-19T10:46:27.762+02:00   1 a9f14018d5 97 GET /?lonlats=13.377485,52.516247%7C13.351221,52.515004&profile=trekking&alternativeidx=0&format=geojson HTTP/1.1
2024-05-19T10:46:28.157+02:00   1 a9f14018d5 53 GET /?lonlats=13.377485,52.516247%7C13.351221,52.515004&profile=trekking&alternativeidx=0&format=geojson HTTP/1.1

@msb48, thank you for your remarks regarding the logged HTTP information. It makes total sense. I'll update the table in the first post accordingly.