Closed danielodievich closed 1 year ago
I confirm same scenario with v2.0.8 and also when compiled latest HEAD - it leaks handles.
Running on 2 machines - first, after 4hrs I can see 7350+ handles open and growing, another machine after 17hrs - 26k and growing.
There are plenty fluent-bit local TCP sockets opened indeed... this is just a snippet from my netstat:
TCP 127.0.0.1:6010 127.0.0.1:6011 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6011 127.0.0.1:6010 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6013 127.0.0.1:6014 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6014 127.0.0.1:6013 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6028 127.0.0.1:6029 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6029 127.0.0.1:6028 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6030 127.0.0.1:6031 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6030 127.0.0.1:6032 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6031 127.0.0.1:6030 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6032 127.0.0.1:6030 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6045 127.0.0.1:6046 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6046 127.0.0.1:6045 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6052 127.0.0.1:6053 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6053 127.0.0.1:6052 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6056 127.0.0.1:6057 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6057 127.0.0.1:6056 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6067 127.0.0.1:6068 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6068 127.0.0.1:6067 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6081 127.0.0.1:6082 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6082 127.0.0.1:6081 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6087 127.0.0.1:6088 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6088 127.0.0.1:6087 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6091 127.0.0.1:6092 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6092 127.0.0.1:6091 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6100 127.0.0.1:6101 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6101 127.0.0.1:6100 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6110 127.0.0.1:6111 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6111 127.0.0.1:6110 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6117 127.0.0.1:6118 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6118 127.0.0.1:6117 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6135 127.0.0.1:6136 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6136 127.0.0.1:6135 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6139 127.0.0.1:6140 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6140 127.0.0.1:6139 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6169 127.0.0.1:6170 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6170 127.0.0.1:6169 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6177 127.0.0.1:6178 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6178 127.0.0.1:6177 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6188 127.0.0.1:6189 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6189 127.0.0.1:6188 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6191 127.0.0.1:6192 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6192 127.0.0.1:6191 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6193 127.0.0.1:6194 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6194 127.0.0.1:6193 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6197 127.0.0.1:6198 ESTABLISHED [fluent-bit.exe] TCP 127.0.0.1:6198 127.0.0.1:6197 ESTABLISHED
Goes up to port 64322. That is a huge leak.
PS. In my scenario, I use tail input with loki output.
Quick script to get all relevant info:
while ($true) {
$svcName="fluent-bit";
$svcProc=(Get-Process -Name $svcName);
$svcUptime=((get-date).Subtract($svcProc.starttime));
$svcSockets=(Get-NetTCPConnection -OwningProcess $svcProc.Id).Count;
Write-Host "[$($svcName)] PID: $($svcProc.Id), Handles: $($svcProc.Handles), Sockets: $($svcSockets), Uptime: $($svcUptime.Days)d $($svcUptime.hours)h $($svcUptime.minutes)m $($svcUptime.seconds)s";
Sleep -Seconds 5;
}
At the moment, I am debugging - trying to track down what is leaking exactly these localhost sockets.
My current stats:
Server_TypeA_01: [fluent-bit] PID: 15776, Handles: 34687, Sockets: 51627, Uptime: 0d 12h 43m 54s
Server_TypeA_02: [fluent-bit] PID: 20520, Handles: 30973, Sockets: 46053, Uptime: 0d 12h 43m 50s
Server_TypeA_03: [fluent-bit] PID: 27512, Handles: 31496, Sockets: 46839, Uptime: 0d 12h 43m 49s
Server_TypeA_04: [fluent-bit] PID: 28928, Handles: 31981, Sockets: 47565, Uptime: 0d 12h 43m 49s
Server_TypeA_05: [fluent-bit] PID: 2892, Handles: 25014, Sockets: 37125, Uptime: 0d 12h 57m 8s
Server_TypeB_01: [fluent-bit] PID: 3256, Handles: 1691, Sockets: 2220, Uptime: 0d 12h 43m 50s
Server_TypeB_02: [fluent-bit] PID: 6420, Handles: 1673, Sockets: 2193, Uptime: 0d 12h 43m 49s
Server_TypeC_01: [fluent-bit] PID: 1348, Handles: 393, Sockets: 276, Uptime: 0d 0h 32m 30s
Short amount of samples on Server_TypeA_05 shows the velocity of things happening:
[fluent-bit] PID: 2892, Handles: 25363, Sockets: 37635, Uptime: 0d 13h 8m 0s
[fluent-bit] PID: 2892, Handles: 25371, Sockets: 37650, Uptime: 0d 13h 8m 27s
[fluent-bit] PID: 2892, Handles: 25381, Sockets: 37665, Uptime: 0d 13h 8m 54s
[fluent-bit] PID: 2892, Handles: 25397, Sockets: 37686, Uptime: 0d 13h 9m 22s
[fluent-bit] PID: 2892, Handles: 25403, Sockets: 37695, Uptime: 0d 13h 9m 49s
[fluent-bit] PID: 2892, Handles: 25413, Sockets: 37710, Uptime: 0d 13h 10m 17s
Some graphs of last 36hrs:
Thanks for looking into this!
The candidate is monkey backend for flb_log module -- I suspect somewhere resources are not correctly free'd and it just opens another socket pair.
/** Create two new sockets that are connected to each other.
On Unix, this simply calls socketpair(). On Windows, it uses the
loopback network interface on 127.0.0.1, and only
AF_INET,SOCK_STREAM are supported.
(This may fail on some Windows hosts where firewall software has cleverly
decided to keep 127.0.0.1 from talking to itself.)
Parameters and return values are as for socketpair()
*/
EVENT2_EXPORT_SYMBOL
int evutil_socketpair(int d, int type, int protocol, evutil_socket_t sv[2]);
/** Do platform-specific operations as needed to make a socket nonblocking.
@param sock The socket to make nonblocking
@return 0 on success, -1 on failure
*/
Used as a "backend" for flb_pipe on Windows:
/*
* Building on Windows means that Monkey library (lib/monkey) and it
* core runtime have been build with 'libevent' backend support, that
* library provide an abstraction to create a socketpairs.
*
* Creating a pipe on Fluent Bit @Windows, means create a socket pair.
*/
int flb_pipe_create(flb_pipefd_t pipefd[2])
On my end, I use "info" log level and yes, lots of handle++ && socket++ happens only when log lines are produced.
This is sample log output and counters before and after:
02/02/2023 02:00:46 [fluent-bit] PID: 18112, Handles: 1858, Sockets: 2388, Uptime: 0d 0h 43m 33s
-> Handles += 4, Sockets += 6 (3 pairs?)
02/02/2023 02:00:48 [fluent-bit] PID: 18112, Handles: 1862, Sockets: 2394, Uptime: 0d 0h 43m 35s
[2023/02/02 02:00:45] [ info] flb_dns_ares_socket: socket(af=2, type=2, protocol=0) = fd=8060
[2023/02/02 02:00:45] [ info] flb_dns_ares_connect: connect(fd=8060, addr=1.1.1.1:13568)
[2023/02/02 02:00:45] [ info] flb_dns_ares_close: (fd=8060)
[2023/02/02 02:00:45] [ info] flb_dns_ares_close: socket close(fd=8060)
[2023/02/02 02:00:45] [ info] flb_dns_ares_socket: socket(af=2, type=2, protocol=0) = fd=8060
[2023/02/02 02:00:45] [ info] flb_dns_ares_connect: connect(fd=8060, addr=1.1.1.1:13568)
[2023/02/02 02:00:45] [ info] flb_net_tcp_connect: socket(af=2, type=1, protocol=0) = fd=8064
[2023/02/02 02:00:45] [ info] net_connect_async: connect(async, fd=8064, addr=3.3.3.3:7180)
[2023/02/02 02:00:45] [ info] [net] connection #8064 in process to 2.2.2.2:3100
[2023/02/02 02:00:45] [ info] flb_dns_ares_close: (fd=8060)
[2023/02/02 02:00:45] [ info] flb_dns_ares_close: socket close(fd=8060)
[2023/02/02 02:00:45] [ info] flb_net_tcp_connect: socket(af=2, type=1, protocol=0) = fd=8060
[2023/02/02 02:00:45] [ info] net_connect_async: connect(async, fd=8060, addr=3.3.3.3:7180)
[2023/02/02 02:00:45] [ info] [net] connection #8060 in process to 2.2.2.2:3100
[2023/02/02 02:00:45] [ info] [io] connection OK
[2023/02/02 02:00:45] [ info] [http_client] not using http_proxy for header
[2023/02/02 02:00:45] [ info] [io] connection OK
[2023/02/02 02:00:45] [ info] [http_client] not using http_proxy for header
[2023/02/02 02:00:45] [ info] prepare_destroy_conn: socket close(fd=8060)
[2023/02/02 02:00:45] [ info] prepare_destroy_conn: socket close(fd=8064)
Congratulations on discovery so far. The wireshark trace shows two connections - bidirectional - appearing every time something happens. The leakage always appears in the increments of two. This is quite a likely candidate indeed.
Ah, flb_pipe is also used for in_tail
... that would explain multiple socket pairs being created. Need to track down if they're properly close everywhere where used.
It seems a mk_event
used by flb_upstream_conn
/ flb_net_tcp_connect
is allocating socketpair, but not releasing it, because fds are 0... so corruption or misassignment in-transit after job is done...
_mk_event_del: evutil_closesocketpair?? (mk_event_ctx=0000021F22AFB2F0, ev_map->pipe_fds={0,0}
_mk_event_del: evutil_closesocketpair?? (mk_event_ctx=0000021F22AFB2F0, ev_map->pipe_fds={0,0}
_mk_event_del: evutil_closesocketpair?? (mk_event_ctx=0000021F22AFB2F0, ev_map->pipe_fds={0,0}
I have some breakthrough, but I am not fully convinced of the change, as there is not much documentation of the libevent and monkey mk_event things. Downside of that is that I am seeing write: no error
messages on timed-out event handlers (cb_timeout
fired), which is also strange, because these events were supposed to be one-shot (c-ares async DNS queries).
After running for 1hr, it keeps number of sockets (and handles) at very same, stable level and can see that whole communication works fine and logs are delivered how they should be delivered. When there is a new request to create a socketpair - it creates it and destroys when done, so socket count goes up and down.
02/03/2023 01:01:22 [fluent-bit] PID: 17544, Handles: 142, Sockets: 0, Uptime: 0d 0h 0m 13s
02/03/2023 01:01:22 [fluent-bit] PID: 17544, Handles: 245, Sockets: 51, Uptime: 0d 0h 0m 13s
02/03/2023 01:01:23 [fluent-bit] PID: 17544, Handles: 292, Sockets: 87, Uptime: 0d 0h 0m 14s
02/03/2023 01:01:24 [fluent-bit] PID: 17544, Handles: 323, Sockets: 111, Uptime: 0d 0h 0m 14s
02/03/2023 01:01:24 [fluent-bit] PID: 17544, Handles: 348, Sockets: 129, Uptime: 0d 0h 0m 15s
02/03/2023 01:01:25 [fluent-bit] PID: 17544, Handles: 388, Sockets: 183, Uptime: 0d 0h 0m 16s
02/03/2023 01:01:25 [fluent-bit] PID: 17544, Handles: 417, Sockets: 225, Uptime: 0d 0h 0m 16s
02/03/2023 01:01:26 [fluent-bit] PID: 17544, Handles: 444, Sockets: 267, Uptime: 0d 0h 0m 17s
02/03/2023 01:01:27 [fluent-bit] PID: 17544, Handles: 474, Sockets: 312, Uptime: 0d 0h 0m 18s
02/03/2023 01:01:28 [fluent-bit] PID: 17544, Handles: 492, Sockets: 339, Uptime: 0d 0h 0m 18s
02/03/2023 01:01:28 [fluent-bit] PID: 17544, Handles: 492, Sockets: 339, Uptime: 0d 0h 0m 19s
02/03/2023 01:01:29 [fluent-bit] PID: 17544, Handles: 492, Sockets: 339, Uptime: 0d 0h 0m 20s
02/03/2023 01:01:30 [fluent-bit] PID: 17544, Handles: 492, Sockets: 339, Uptime: 0d 0h 0m 21s
(...)
02/03/2023 02:01:57 [fluent-bit] PID: 17544, Handles: 501, Sockets: 339, Uptime: 0d 1h 0m 47s
02/03/2023 02:01:57 [fluent-bit] PID: 17544, Handles: 501, Sockets: 339, Uptime: 0d 1h 0m 48s
02/03/2023 02:01:58 [fluent-bit] PID: 17544, Handles: 501, Sockets: 339, Uptime: 0d 1h 0m 49s
02/03/2023 02:01:59 [fluent-bit] PID: 17544, Handles: 501, Sockets: 339, Uptime: 0d 1h 0m 50s
diff --git a/lib/monkey/mk_core/mk_event_libevent.c b/lib/monkey/mk_core/mk_event_libevent.c
index dc47f4371..530d40dd8 100644
--- a/lib/monkey/mk_core/mk_event_libevent.c
+++ b/lib/monkey/mk_core/mk_event_libevent.c
@@ -188,6 +188,13 @@ static inline int _mk_event_add(struct mk_event_ctx *ctx, evutil_socket_t fd,
flags |= EV_WRITE;
}
+ if (event->data != NULL) {
+ struct ev_map *in_ev_map = event->data;
+
+ ev_map->pipe[0] = in_ev_map->pipe[0];
+ ev_map->pipe[1] = in_ev_map->pipe[1];
+ }
+
/* Compose context */
event->fd = fd;
event->type = type;
@@ -323,8 +323,9 @@ static inline int _mk_event_timeout_create(struct mk_event_ctx *ctx,
event->fd = fd[0];
event->type = MK_EVENT_NOTIFICATION;
event->mask = MK_EVENT_EMPTY;
+ event->data = ev_map;
- _mk_event_add(ctx, fd[0], MK_EVENT_NOTIFICATION, MK_EVENT_READ, data);
+ _mk_event_add(ctx, fd[0], MK_EVENT_NOTIFICATION, MK_EVENT_READ, event);
event->mask = MK_EVENT_READ;
return fd[0];
I also have bunch of other buf/handle/mem leak fixes that will probably add as a PR anyways.
Could somebody from project owners comment on this one?
It seems the c-ares DNS async query (flb_net_getaddrinfo
) is never going to destroy event completely, thus never destroys the pipe and event object itself, unless I am missing something... I can provide some extra debug logs added in useful places with object pointers, fds, etc.
Good news, indeed! @danielodievich and I will be happy to do some additional testing with the original scenarios we encountered this with, whenever ready.
Actually, I have another much easier workaround - disable async DNS query at all... also works well and keeps socket/handle level stable...
diff --git a/src/flb_network.c b/src/flb_network.c
index f3a0fd2a4..afac72171 100644
--- a/src/flb_network.c
+++ b/src/flb_network.c
@@ -1193,7 +1193,7 @@ flb_sockfd_t flb_net_tcp_connect(const char *host, unsigned long port,
struct flb_connection *u_conn)
{
int ret;
- int use_async_dns;
+ int use_async_dns = 0;
char resolver_initial;
flb_sockfd_t fd = -1;
char _port[6];
@@ -1216,7 +1216,9 @@ flb_sockfd_t flb_net_tcp_connect(const char *host, unsigned long port,
/* fomart the TCP port */
snprintf(_port, sizeof(_port), "%lu", port);
+#if 0
use_async_dns = is_async;
+#endif
if (u_conn->net->dns_resolver != NULL) {
resolver_initial = toupper(u_conn->net->dns_resolver[0]);
I think that's fair workaround until async DNS / event internals will be resolved.
If somebody wants to play with experimental changes -- please find them on my tmp branch https://github.com/MrTomasz/fluent-bit/commits/mr.t/tmp-2.0.9rc
Please note that PR #6782 does not fix handle/socket leak.
Using sync-DNS workaround:
02/03/2023 11:28:52 [fluent-bit] PID: 22916, Handles: 499, Sockets: 339, Uptime: 0d 8h 43m 3s
no more leaks and memory usage is at stable level.
I cloned https://github.com/MrTomasz/fluent-bit/commits/mr.t/tmp-2.0.9rc branch, compiled it and ran on my Windows Server 2016 environment. I confirm that the handles no longer leak.
Happy to hear that. Maybe as an interim solution, a config option to select async/sync mode for DNS queries could be an option to integrate, before correct fix for async mode will be made @edsiper ?
On my end -- still works very well and memory consumption + socket/handle count is at stable level:
That option exists, it's not intended to be widely used but if you set net.dns.resolver
to legacy
fluent-bit will not issue asynchronous dns lookups.
Another way to verify if the asynchronous dns client is part of the issue would be setting net.dns.mode
to TCP
and verifying that fluent-bit is indeed contacting the DNS server using TCP because in that case timers are not used.
I haven't been able to look into this issue but what you discovered sounds very interesting. The async dns client uses a timer when the lookup is performed through UDP in order to be able to enforce timeouts (you can check that in flb_net_getaddrinfo
) and those timers (provided by flb_sched_timer_cb_create
) use mk_event_timeout_create
which creates a socket pair (the read end is added to the event loop) and a libevent timer which invokes a callback that writes to the socketpair causing the event loop to detect the activity but as I was checking the source for reference I noticed something I didn't see before :
When _mk_event_timeout_create creates a timer it allocates an ev_map structure and saves both sockets in the pipe
entries. It also sets the read side of the socket pair as the events "file descriptor".
When the timer callback is invoked it closes its side of the socketpair and releases the ev_map instance.
When _mk_event_del is called it tries to check the ev_map pipe
values to determine if it needs to close them but that is a different instance and it doesn't have those values.
And there is no code in the scheduler to close the file descriptor (because it should not care about it since it should be abstracted by mk_event_timeout_destroy
.
So I'm thinking that if the timer ticks (timeout detected) the write end should be closed but the read end wouldn't and if the timer does not tick neither of them would be closed.
So since _mk_event_add
adds the ev_map
instance to the event here what we might want to do is in _mk_event_timeout_create right after the event is added we might want to fill both pipe
values and I think this will probably mean we will want to remove the code that closes the socket in cb_timeout
but I could be wrong since this is what I just realized while reading the code with a fresh pair of eyes and not something I verified through debugging.
Please take this with a grain of salt. I hope this helps and please feel free to ping me here or on slack, I think I'll be able to make time for this rather soon.
Thanks a lot @MrTomasz, your insight was really helpful.
Any updates on this issue's status? Will it be fixed in the future versions of fluent-bit @leonardo-albertovich ?
Oh I'm so glad you brought this up, I've tried to look for this issue but couldn't find it.
There are no updates at the moment, the only workaround is setting the dns resolver to legacy mode but it will be addressed, however, I don't have an ETA since I am deep into something else and can't afford to switch contexts without risking messing up either of them and that's the last thing I'd like to do.
Rest assured that this is high on my personal priority list but as I said, sadly I can't get to it until I finish what I'm focusing on.
I am working on submitting a fix for this issue. If any maintainers would like to assign this issue to me feel free.
@braydonk assigned
The fix provided in Monkey should be available in master and 2.0 branches now, meaning I assume they will be part of 2.1 and 2.0.12 (a maintainer should verify that). I verified my fix using the setup suggested in the first comment on this issue, and it stopped leaking handles.
For those interested in the technical details, I provided the step-by-step of the handle leak in the mentioned monkey issue.
The fix will be released in the versions @braydonk mentioned.
Is this part of some release distribution?
It's in 2.1 that got released a few hours ago, and it will be in 2.0.12 whenever that gets released.
Bug Report
Describe the bug Fluent-Bit 2.0.8.0 is leaking \Device\Afd file handles on Windows Server 2016, Windows Server 2019 and Windows 10 when diagnostic logging is sent between threads using local sockets.
The rate of leakage is proportional to the level of diagnostic logging and logging of errors.
We recently decided to test 2.0.8 version of this agent and discovered that it behaves different from 1.9.6, with continuously increasing memory consumption. Left unchecked, it leads to either fluent-bit crash and/or general system instability.
Our Fluent-Bit Usage Observe uses Fluent-Bit to do log shipping as part of our [Host Monitoring App]. The
agent.ps1
install script is currently pinned to pull fluent-bit 1.9.6. It also installs Telegraf and OSQuery although the issue. During default install, this fluent-bit.conf is used, with<<observe_host_name>>
and<<observe_host_name>>
variables replaced to the values provided to the script. We use http output plugin to send things to our ingest endpoints.To Reproduce Install Fluent-Bit 2.0.8.0.
Replace
C:\Program Files\fluent-bit\conf\fluent-bit.conf
with this configuration:Both the ingest endpoint and authentication token are intentionally bad. This increases number of logged failures and increases the rate of
\Device\Afd
file handles leakage.Start/restart fluent-bit service.
Start monitoring
\Process(fluent-bit)\Handle Count
counter and observe it steadily climbing upwards as time goes by.Expected behavior Fluent-Bit should not leak memory or destabilize the operating system.
Your Environment
Diagnostic Capture Details I captured starting fluent-bit from scratch and running it for 180 seconds, including these data sets:
Perfview command used:
Script that i used with Sysinternals Handle to snapshot detailed list of what was outstanding in the process every 10 seconds:
Analysis of Run 1 During Run 1, I had correct URL and authentication token in the fluent-bit.conf OUTPUT section for sending data to Observe.
Results in GDrive/fluent-bit-handle-leak-Run1.zip
Perfmon counters The
\Process(fluent-bit)\Handle Count
counter in\Run1\OBSERVEWS2016_20230127-000007\Performance Counter.blg
climbs from 273 to 317Handle list The
\Run1\fluent-bit-handles-run-1.xlsx
based on\Run1\handles.csv
shows growth of those handles by type:Fluent Log
\Run1\fluent.log
shows what was written to its log.Wireshark Log
\Run1\fluent-bit-trace.pcapng.gz
captures local loopback and external adapter traffic.Trace Analysis I loaded
\Run1\fluent-bit-trace.etl.zip
into Windows Performance Analyzer.In
Memory\Handles\Outstanding Count by Process
report:Closing Process
column to the left of gold lineObject Name
column to the left of gold-lineCreate Time
andClose Time
columnsfluent-bit.exe
Handle Type
toFile
Closing Process
toUnknown
to indicate handles that are still not closed by the time the trace endsObject Name
to\Device\Afd
to focus on the intra-process socket communicationIn
Communications\TcpIp Events\Events by Count
report:Process Name
column to the left of gold line as firstfluent-bit.exe
You can see the steady growth of outstanding
\Device\Afd
handles starting at trace time 6.48:Zooming in into trace time 11.55-12.00 I see some sort of TCP conversation. I chose the 211 in the.
Correlating this to the time in Wireshark, I can see that this socket did this conversation:
The payloads in packet 522 shows that this is is clearly a diagnostic log:
These correspond to the lines 39-54 in the
fluent.log
although some other plugins interleave:Analysis of Run 2 On Run 2, I had intentionally broke the URL and provided bad authentication token in the fluent-bit.conf OUTPUT section for sending data to Observe. This made fluent-bit error out more and do more diagnostic logging of errors and retries.
Results in GDrive/fluent-bit-handle-leak-Run2.zip
Perfmon counters The
\Process(fluent-bit)\Handle Count
counter in\Run2\OBSERVEWS2016_20230127-000008\Performance Counter.blg
climbs from 273 to 428. This is steeper of a climb than Run 2Handle list The
\Run2\fluent-bit-handles-run-1.xlsx
based on\Run2\handles.csv
shows growth of those handles by type, again at higher rate:Fluent Log
\Run2\fluent.log
shows what was written to its log.Wireshark Log
\Run2\fluent-bit-trace.pcapng.gz
captures local loopback and external adapter traffic.Trace Analysis I loaded
\Run2\fluent-bit-trace.etl.zip
into Windows Performance Analyzer.Same configuration as in Run 2 shows high rate of
\Device\Afd
handle creation without disposing. There is also some counts of Tcp Errors, I think thoseSpeculation on Root Cause I think something in the diagnostic logging that cross-posts messages between various threads and components via the local loopback socket is allocating the handle but is forgetting to dispose them. But it only does it for debug style and/or error messages. Higher observed rate of leakage when more errors are logging support this hypothesis.