Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.24k stars 1.05k forks source link

HTTP GELF Input silently drops most messages #13959

Open Ghostbird opened 1 year ago

Ghostbird commented 1 year ago

Expected Behavior

This morning my Graylog was working as intended. It's a docker-compose set-up running behind an Apache TLS reverse proxy. I had to update the FQDN of the server. I added a new virtual host with the new FQDN. I also kept the old virtualhost. Then I updated my services' log configurations to point to the new FQDN for logging. I expected things to keep working.

Current Behavior

I noticed that suddenly I was receiving very few log messages, but not none. I reverted the changes of the services' configuration, and started testing the new FQDN using cURL:

while true
  do curl -i -X POST -H 'Content-Type: application/json' -d '{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }' 'https://redacted.example.org/gelf'
  sleep 3
done

I noticed that of the cURL messages, sent every 3 seconds, I received approximately one per minute on average.

I reverted my changes and went back to using the old FQDN. However the problem persisted.

Currently I can look at the input, see that the amount of messages received is 0/minute, the data received counters are slowly increasing, and when I look in the apache access logs I see that it responds a HTTP 202 for tens of log messages per second, none of which show up in graylog.

Everything I can see at the HTTP layer works perfectly fine. Each message is acknowledged with a HTTP 202. The Graylog GUI running behind the same proxy, including Apache OIDC (mod authz_oidc) + trusted proxy authentication works fine. Nevertheless, I get an average of 40 log messages an hour in Graylog the last 24 hours, even though the actual amount sent by the servers is around 6,000 an hour.

If I stop the HTTP Gelf input, I immediately see the stream of POSTS log errors.

Here's the full trace7 apache2 output of a proxied request:

[Tue Nov 15 08:41:41.483634 2022] [core:trace5] [pid 2809430:tid 140008686089984] protocol.c(708): [client 52.233.195.70:61163] Request received from client: POST /gelf HTTP/1.1
[Tue Nov 15 08:41:41.483733 2022] [http:trace4] [pid 2809430:tid 140008686089984] http_request.c(436): [client 52.233.195.70:61163] Headers received from client:
[Tue Nov 15 08:41:41.483748 2022] [http:trace4] [pid 2809430:tid 140008686089984] http_request.c(439): [client 52.233.195.70:61163]   Host: REDACTED
[Tue Nov 15 08:41:41.483758 2022] [http:trace4] [pid 2809430:tid 140008686089984] http_request.c(439): [client 52.233.195.70:61163]   Cache-Control: no-cache
[Tue Nov 15 08:41:41.483767 2022] [http:trace4] [pid 2809430:tid 140008686089984] http_request.c(439): [client 52.233.195.70:61163]   Content-Type: application/json; charset=utf-8
[Tue Nov 15 08:41:41.483775 2022] [http:trace4] [pid 2809430:tid 140008686089984] http_request.c(439): [client 52.233.195.70:61163]   Content-Length: 455
[Tue Nov 15 08:41:41.483797 2022] [proxy:trace2] [pid 2809430:tid 140008686089984] mod_proxy.c(687): [client 52.233.195.70:61163] AH03461: attempting to match URI path '/gelf' against prefix '/gelf' for proxying
[Tue Nov 15 08:41:41.483809 2022] [proxy:trace1] [pid 2809430:tid 140008686089984] mod_proxy.c(773): [client 52.233.195.70:61163] AH03464: URI path '/gelf' matches proxy handler 'proxy:http://127.0.0.1:12202/gelf'
[Tue Nov 15 08:41:41.483833 2022] [authz_core:debug] [pid 2809430:tid 140008686089984] mod_authz_core.c(845): [client 52.233.195.70:61163] AH01628: authorization result: granted (no directives)
[Tue Nov 15 08:41:41.483844 2022] [core:trace3] [pid 2809430:tid 140008686089984] request.c(310): [client 52.233.195.70:61163] request authorized without authentication by access_checker_ex hook: /gelf
[Tue Nov 15 08:41:41.483860 2022] [proxy_http:trace1] [pid 2809430:tid 140008686089984] mod_proxy_http.c(62): [client 52.233.195.70:61163] HTTP: canonicalising URL //127.0.0.1:12202/gelf
[Tue Nov 15 08:41:41.483895 2022] [proxy:trace2] [pid 2809430:tid 140008686089984] proxy_util.c(2145): [client 52.233.195.70:61163] http: found worker http://127.0.0.1:12202/gelf for http://127.0.0.1:12202/gelf
[Tue Nov 15 08:41:41.483908 2022] [proxy:debug] [pid 2809430:tid 140008686089984] mod_proxy.c(1254): [client 52.233.195.70:61163] AH01143: Running scheme http handler (attempt 0)
[Tue Nov 15 08:41:41.483944 2022] [proxy_http:trace1] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1985): [client 52.233.195.70:61163] HTTP: serving URL http://127.0.0.1:12202/gelf
[Tue Nov 15 08:41:41.483956 2022] [proxy:debug] [pid 2809430:tid 140008686089984] proxy_util.c(2341): AH00942: HTTP: has acquired connection for (127.0.0.1)
[Tue Nov 15 08:41:41.483982 2022] [proxy:debug] [pid 2809430:tid 140008686089984] proxy_util.c(2395): [client 52.233.195.70:61163] AH00944: connecting http://127.0.0.1:12202/gelf to 127.0.0.1:12202
[Tue Nov 15 08:41:41.483998 2022] [proxy:debug] [pid 2809430:tid 140008686089984] proxy_util.c(2604): [client 52.233.195.70:61163] AH00947: connected /gelf to 127.0.0.1:12202
[Tue Nov 15 08:41:41.484036 2022] [proxy:trace2] [pid 2809430:tid 140008686089984] proxy_util.c(2886): HTTP: reusing backend connection 127.0.0.1:41896<>127.0.0.1:12202
[Tue Nov 15 08:41:41.484051 2022] [core:trace6] [pid 2809430:tid 140008686089984] core_filters.c(519): [remote 127.0.0.1:12202] will flush because of FLUSH bucket
[Tue Nov 15 08:41:41.484776 2022] [proxy_http:trace3] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1361): [client 52.233.195.70:61163] Status from backend: 202
[Tue Nov 15 08:41:41.484801 2022] [proxy_http:trace4] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1016): [client 52.233.195.70:61163] Headers received from backend:
[Tue Nov 15 08:41:41.484813 2022] [proxy_http:trace4] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1039): [client 52.233.195.70:61163] content-length: 0
[Tue Nov 15 08:41:41.484826 2022] [proxy_http:trace4] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1039): [client 52.233.195.70:61163] connection: keep-alive
[Tue Nov 15 08:41:41.484841 2022] [proxy_http:trace3] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1724): [client 52.233.195.70:61163] start body send
[Tue Nov 15 08:41:41.484853 2022] [proxy:debug] [pid 2809430:tid 140008686089984] proxy_util.c(2356): AH00943: http: has released connection for (127.0.0.1)
[Tue Nov 15 08:41:41.484874 2022] [http:trace3] [pid 2809430:tid 140008686089984] http_filters.c(1125): [client 52.233.195.70:61163] Response sent with status 202, headers:
[Tue Nov 15 08:41:41.484885 2022] [http:trace5] [pid 2809430:tid 140008686089984] http_filters.c(1134): [client 52.233.195.70:61163]   Date: Tue, 15 Nov 2022 07:41:41 GMT
[Tue Nov 15 08:41:41.484895 2022] [http:trace5] [pid 2809430:tid 140008686089984] http_filters.c(1137): [client 52.233.195.70:61163]   Server: Apache/2.4.41 (Ubuntu)
[Tue Nov 15 08:41:41.484905 2022] [http:trace4] [pid 2809430:tid 140008686089984] http_filters.c(955): [client 52.233.195.70:61163]   content-length: 0
[Tue Nov 15 08:41:41.484935 2022] [proxy_http:trace2] [pid 2809430:tid 140008686089984] mod_proxy_http.c(1870): [client 52.233.195.70:61163] end body send

Context

Running in docker-compose on Ubuntu VPS

Your Environment

Ghostbird commented 1 year ago

Solution

After completely discarding the current data, and setting up a new docker-compose with new data, the problem still persisted. Therefore it must have been a configuration error with my set-up, and completely unrelated to the FQDN change.

In the end, it turns out that the GELF HTTP Input pretends to accept, but silently drops most packets when the Enable Bulk Receiving option is set. Oddly enough, AFAIK we had enabled that option months ago when we set-up up an opt-in diagnostic remote logging feature on our mobile clients, that would bulk log the local sqlite log. The idea was that you could enable diagnostics after a crash, and it would still log the crash retroactively to graylog.

I don't know why the problem came up now. I even tried running a previous graylog version (4.3.8), but it had the same issue.

It's also quite bizarre that it sometimes allowed a message through. Even a simple consistently identical (barring timestamps) curl POST only came through ~2% of the time.

Ghostbird commented 1 year ago

Actually, I'll leave this open. Feel free to close this, if that is not an error of the HTTP Gelf Input. I might just have misunderstood what the Enable Bulk Receiving option does.

danotorrey commented 1 year ago

Thanks for all of the details here @Ghostbird.

Is it correct that you are only seeing the issue when the Enable Bulk Receiving option is enabled? If so, is the issue only happening when attempting to send in multiple messages? If so, one potential cause is the format multiple messages being sent. This option requires that each message be newline delimited (separated with \nor \r\n). One caveat when using curl is that the data -d parameter must be proceeded with a $ for the newlines to be sent correctly. For example:

curl -i -X POST -H 'Content-Type: application/json' -d $'{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }\r\n{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }\r\n{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }' 'localhost:12201/gelf'

Please let me know if this helps.

I also saw in the original description that the issue was happening when only sending single messages. I tested, and I was not able to reproduce that issue. But, please let me know if you are still seeing that, and I am happy to dig into it more.

Ghostbird commented 1 year ago

Thank you so much for your investigation. Our graylog is working well at the moment. The issue was indeed occurred with the Enable Bulk Receiving option enabled. The thing that threw me for a loop, was that it seems a very small amount of messages still comes through correctly.

I used the cURL script show above, which simply sends the same message every three seconds, and after five minutes around two or three messages would have made it through.

I enabled the option, because I thought that one of our connected clients needed it. However, it's working fine without this option enabled. I think it doesn't use GELF's protocol bulk messaging, but rather periodically sends queued HTTP messages on a single TCP connection.

So currently for us there the issue is solved, it turned out to be misconfiguration on our side.

I thought the option had been enabled for months. But it seemed to have been activated accidentally during a FQDN and accompanying proxy change. I spent a lot of time figuring out that it was not actually caused by those changes.

It might be considered a bug that a non-delimited message is silently (no log whatsoever) dropped when this option is active. Maybe the UI should make it clear that when you enable the Enable Bulk Receiving, the input becomes only usable for messages that correctly implement the bulk receiving protocol (i.e. single messages must still be delimited).

Now that I know this, if I ever run into the situation where we really need bulk receiving, I'll just open a second GELF HTTP input on a different port for clients that use the bulk protocol.

Feel free to close this, because it can certainly be considered works as designed.

danotorrey commented 1 year ago

@Ghostbird Thanks for the additional info. I am glad the input is working now, but I would like to dig into this a big further. Even with the bulk option enabled, the input should receive single non-delimited HTTP Gelf messages successfully. If that is not happening, I agree, that could potentially be a bug/issue.

I had tested with the option enabled and sending in single messages (similar to your example), but I was not able to reproduce the dropped message. I know that all is working on your side now, but if you are still able to reproduce the issue where messages seem to be dropped when the option is enabled, please let me know. We would like to try and resolve that if it is still an issue.

Ghostbird commented 1 year ago

Here's a screenshot, left is my graylog filter in live-update, once-per-second, mode. Right is the running curl script mentioned earlier.

image

Somewhere between 8:50:56 and 8:50:59 the GELF Input was switched to bulk mode. After I disable bulk mode, incoming messages appear in the live-updated view again.

danotorrey commented 1 year ago

Thanks @Ghostbird. Can you please also provide the curl command you are using? I just want to be certain I test with all of the same inputs.

Ghostbird commented 1 year ago

It's the one at the top of this conversation. Of course, with a non-redacted server URI.

perryflynn commented 10 months ago

We ran into the same issue with the following curl command:

for i in $(seq 1 100)
do 
    curl -X POST -v \
        'https://logger.example.com/gelf' \
        --header 'Content-Type: application/json; charset=utf-8' \
        --data '{"short_message":"Hello there 6", "host":"example.org", "facility":"test", "_foo":"bar", "message": "Hello all there"}\n'; 
done

Environment

# Kubernetes Ingress Manifest
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: graylog-ingress-gelfhttp-os
  namespace: logging
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.org/client-max-body-size: "50m"
spec:
  tls:
    - hosts:
      - logger.example.com
  rules:
  - host: logger.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: graylog-service-os
            port:
              name: gelfhttp

Only the first request of a connection is accecpted

I've installed tcpdump into the graylog/graylog docker image and created a wireshark file.

From 100 messages, exactly 16 got accepted from graylog. All 100 messages are shown in tcpdump, so reverse proxy or load balancer are not blocking anything.

I'm quite sure, that only the first request of each TCP connection got accepted.

(HTTP can reuse TCP connections and send multiple requests on one TCP connection)

When I filter wireshark to show every first POST /gelf of a TCP connection, exactly 16 entries are shown in wireshark:

image

When I disable bulk receiving, it works just fine.

/cc @danotorrey @Ghostbird

Ghostbird commented 10 months ago

Are you sure the data is sent in the correct GELF bulk format? I myself kind of dismissed this, as it turned out our client didn't implement GELF bulk at all. It'd just queue all the messages, and then sends them all individually in a big burst. I thought it was quite a terrible implementation.

As I didn't have a client that actually generated GELF bulk messages, I didn't look into this any more.

perryflynn commented 10 months ago

We don't use bulking at all. My example script is also sending 100 individual http request instead of bulking. I expect(ed) that bulking is an optional feature and it doesn't matter, if I send 1 or 100 messages in one http request.

Ghostbird commented 10 months ago

As I understood it, it's really a protocol switch. Inputs can run either the normal GELF protocol, or the bulk protocol.

So sending with a non-bulk client to a bulk input won't work.

The confusing part is that the input will handle the first message apparently correctly even if you've mismatched protocols.

On Mon, 28 Aug 2023, 15:56 Christian, @.***> wrote:

We don't use bulking at all. My example script is also sending 100 individual http request instead of bulking. I expect(ed) that bulking is an optional feature and it doesn't matter, if I send 1 or 100 messages in one http request.

— Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog2-server/issues/13959#issuecomment-1695748400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFPZPK2Z2W3IFZL3GD77DXXSPSLANCNFSM6AAAAAASAACIDM . You are receiving this because you were mentioned.Message ID: @.***>

perryflynn commented 10 months ago

But what's the difference between a single message in a bulk with a \n at the end and multiple messages? I don't see any.