Closed bogma closed 5 years ago
I'd like to look into some more detailed testing to figure out what's going on. out of couriosity how much data are you generating with your logs? do you have any metrics on number of messages per second, etc?
Also, I'm not very familiar with the TCP / UDP data inputs with splunk but i'm wondering if for high traffic logging it might be a better fit instead of the HEC? HEC is doing http over TCP so there is a lot of overhead with that as you have found. Setting up a direct TCP connection might be the way to open up the firehose and push the log data directly into splunk at full speed.
The number of messages is not that high. There are peaks with some hundred messages in a few seconds - nothing that Splunk should have a problem with. The problem is that our Azure Service Plan limits the number of open sockets to 2000. Each message (http request) opens a new socket and this socket stays in the TIME_WAIT state for 120 seconds (defined by operating system). If this limit (2000) is reached no new socket can be opened. Setting 'batchSizeCount > 0' helped because it limits the number of sockets to 4 per second (using the default BatchInterval of 250ms set in the code). But in fact we need only 1 socket - and this greatly increases performance as I have seen. With batching and 1 socket I can send 15000 messages per second to my local docker hosted Splunk instance. With the original setup this test only scored about 170 messages per second.
For curiosity I tested another (non c#) client. I choose a go example which showed the same behavior - opening a new socket for each message.
This is a workaround to prevent socket exhaustion.
We use this package in our azure web-apps to forward log messages to Splunk. We ended up losing log messages due to socket/port exhaustion having thousands of connections in the TIME_WAIT state.
Batching the log messages reduced the problem but did not solve it.
After many tests, I recovered that the connection is closed after each request. Event setting the httpClient.DefaultRequestHeaders.ConnectionClose to false (which should be the default value since HTTP v1.1) was not recognized by the HEC endpoint. Only switching back to HTTP v1.0 and setting the keep-alive header worked.
I assume this as a workaround. In my opinion, this has to be fixed in the HEC endpoint.