Open Ghostbird opened 1 year ago
After completely discarding the current data, and setting up a new docker-compose with new data, the problem still persisted. Therefore it must have been a configuration error with my set-up, and completely unrelated to the FQDN change.
In the end, it turns out that the GELF HTTP Input pretends to accept, but silently drops most packets when the Enable Bulk Receiving option is set. Oddly enough, AFAIK we had enabled that option months ago when we set-up up an opt-in diagnostic remote logging feature on our mobile clients, that would bulk log the local sqlite log. The idea was that you could enable diagnostics after a crash, and it would still log the crash retroactively to graylog.
I don't know why the problem came up now. I even tried running a previous graylog version (4.3.8), but it had the same issue.
It's also quite bizarre that it sometimes allowed a message through. Even a simple consistently identical (barring timestamps) curl POST only came through ~2% of the time.
Actually, I'll leave this open. Feel free to close this, if that is not an error of the HTTP Gelf Input. I might just have misunderstood what the Enable Bulk Receiving option does.
Thanks for all of the details here @Ghostbird.
Is it correct that you are only seeing the issue when the Enable Bulk Receiving
option is enabled? If so, is the issue only happening when attempting to send in multiple messages? If so, one potential cause is the format multiple messages being sent. This option requires that each message be newline delimited (separated with \n
or \r\n
). One caveat when using curl
is that the data -d
parameter must be proceeded with a $
for the newlines to be sent correctly. For example:
curl -i -X POST -H 'Content-Type: application/json' -d $'{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }\r\n{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }\r\n{ "version": "1.1", "host": "redacted.com", "short_message": "A short message", "level": 5, "_some_info": "foo" }' 'localhost:12201/gelf'
Please let me know if this helps.
I also saw in the original description that the issue was happening when only sending single messages. I tested, and I was not able to reproduce that issue. But, please let me know if you are still seeing that, and I am happy to dig into it more.
Thank you so much for your investigation. Our graylog is working well at the moment. The issue was indeed occurred with the Enable Bulk Receiving
option enabled. The thing that threw me for a loop, was that it seems a very small amount of messages still comes through correctly.
I used the cURL script show above, which simply sends the same message every three seconds, and after five minutes around two or three messages would have made it through.
I enabled the option, because I thought that one of our connected clients needed it. However, it's working fine without this option enabled. I think it doesn't use GELF's protocol bulk messaging, but rather periodically sends queued HTTP messages on a single TCP connection.
So currently for us there the issue is solved, it turned out to be misconfiguration on our side.
I thought the option had been enabled for months. But it seemed to have been activated accidentally during a FQDN and accompanying proxy change. I spent a lot of time figuring out that it was not actually caused by those changes.
It might be considered a bug that a non-delimited message is silently (no log whatsoever) dropped when this option is active. Maybe the UI should make it clear that when you enable the Enable Bulk Receiving
, the input becomes only usable for messages that correctly implement the bulk receiving protocol (i.e. single messages must still be delimited).
Now that I know this, if I ever run into the situation where we really need bulk receiving, I'll just open a second GELF HTTP input on a different port for clients that use the bulk protocol.
Feel free to close this, because it can certainly be considered works as designed.
@Ghostbird Thanks for the additional info. I am glad the input is working now, but I would like to dig into this a big further. Even with the bulk option enabled, the input should receive single non-delimited HTTP Gelf messages successfully. If that is not happening, I agree, that could potentially be a bug/issue.
I had tested with the option enabled and sending in single messages (similar to your example), but I was not able to reproduce the dropped message. I know that all is working on your side now, but if you are still able to reproduce the issue where messages seem to be dropped when the option is enabled, please let me know. We would like to try and resolve that if it is still an issue.
Here's a screenshot, left is my graylog filter in live-update, once-per-second, mode.
Right is the running curl
script mentioned earlier.
Somewhere between 8:50:56 and 8:50:59 the GELF Input was switched to bulk mode. After I disable bulk mode, incoming messages appear in the live-updated view again.
Thanks @Ghostbird. Can you please also provide the curl command you are using? I just want to be certain I test with all of the same inputs.
It's the one at the top of this conversation. Of course, with a non-redacted server URI.
We ran into the same issue with the following curl
command:
for i in $(seq 1 100)
do
curl -X POST -v \
'https://logger.example.com/gelf' \
--header 'Content-Type: application/json; charset=utf-8' \
--data '{"short_message":"Hello there 6", "host":"example.org", "facility":"test", "_foo":"bar", "message": "Hello all there"}\n';
done
# Kubernetes Ingress Manifest
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: graylog-ingress-gelfhttp-os
namespace: logging
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.org/client-max-body-size: "50m"
spec:
tls:
- hosts:
- logger.example.com
rules:
- host: logger.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: graylog-service-os
port:
name: gelfhttp
I've installed tcpdump
into the graylog/graylog
docker image and created a wireshark file.
From 100 messages, exactly 16 got accepted from graylog. All 100 messages are shown in tcpdump, so reverse proxy or load balancer are not blocking anything.
I'm quite sure, that only the first request of each TCP connection got accepted.
(HTTP can reuse TCP connections and send multiple requests on one TCP connection)
When I filter wireshark to show every first POST /gelf
of a TCP connection, exactly 16 entries are shown in wireshark:
When I disable bulk receiving, it works just fine.
/cc @danotorrey @Ghostbird
Are you sure the data is sent in the correct GELF bulk format? I myself kind of dismissed this, as it turned out our client didn't implement GELF bulk at all. It'd just queue all the messages, and then sends them all individually in a big burst. I thought it was quite a terrible implementation.
As I didn't have a client that actually generated GELF bulk messages, I didn't look into this any more.
We don't use bulking at all. My example script is also sending 100 individual http request instead of bulking. I expect(ed) that bulking is an optional feature and it doesn't matter, if I send 1 or 100 messages in one http request.
As I understood it, it's really a protocol switch. Inputs can run either the normal GELF protocol, or the bulk protocol.
So sending with a non-bulk client to a bulk input won't work.
The confusing part is that the input will handle the first message apparently correctly even if you've mismatched protocols.
On Mon, 28 Aug 2023, 15:56 Christian, @.***> wrote:
We don't use bulking at all. My example script is also sending 100 individual http request instead of bulking. I expect(ed) that bulking is an optional feature and it doesn't matter, if I send 1 or 100 messages in one http request.
— Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog2-server/issues/13959#issuecomment-1695748400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFPZPK2Z2W3IFZL3GD77DXXSPSLANCNFSM6AAAAAASAACIDM . You are receiving this because you were mentioned.Message ID: @.***>
But what's the difference between a single message in a bulk with a \n
at the end and multiple messages? I don't see any.
Expected Behavior
This morning my Graylog was working as intended. It's a docker-compose set-up running behind an Apache TLS reverse proxy. I had to update the FQDN of the server. I added a new virtual host with the new FQDN. I also kept the old virtualhost. Then I updated my services' log configurations to point to the new FQDN for logging. I expected things to keep working.
Current Behavior
I noticed that suddenly I was receiving very few log messages, but not none. I reverted the changes of the services' configuration, and started testing the new FQDN using cURL:
I noticed that of the cURL messages, sent every 3 seconds, I received approximately one per minute on average.
I reverted my changes and went back to using the old FQDN. However the problem persisted.
Currently I can look at the input, see that the amount of messages received is 0/minute, the data received counters are slowly increasing, and when I look in the apache access logs I see that it responds a HTTP 202 for tens of log messages per second, none of which show up in graylog.
Everything I can see at the HTTP layer works perfectly fine. Each message is acknowledged with a HTTP 202. The Graylog GUI running behind the same proxy, including Apache OIDC (mod authz_oidc) + trusted proxy authentication works fine. Nevertheless, I get an average of 40 log messages an hour in Graylog the last 24 hours, even though the actual amount sent by the servers is around 6,000 an hour.
If I stop the HTTP Gelf input, I immediately see the stream of POSTS log errors.
Here's the full trace7 apache2 output of a proxied request:
Context
Running in docker-compose on Ubuntu VPS
Your Environment