Closed andrewkroh closed 2 years ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
It looks like my theory was wrong. Deferring the call to after the init phase did allow the process to progress, but the same call hangs in other places leading other blocked goroutines.
That call that is hanging is to NetUserGetInfo("NT AUTHORITY", "SYSTEM", 10, buffer)
. The service runs as the system account. It's odd to see the server name parameter of the call be populated with NT AUTHORITY
. It would seem like that could lead to an attempt to make an RPC call.
The code in lookupFullName
seems to be making an assumption that lookupFullNameServer
will fail fast if the suspected domain name ("NT AUTHORITY") is not a real server.
In this particular case the computer is not a member of domain, but it is a member of a workgroup.
We found two other mentions of similar issues when using NetUserGetInfo
.
My issue wasn't in a multi-domain environment, but rather an environment that has no NetBIOS resolution available. This caused the initial timeout for NetUserGetInfo
... session gets timed out after 60 seconds waiting for Deleve to start up ... on a domain-joined computer that is currently not connected to the domain.
We have a reliable means of reproducing. If the Windows hosts has been hardened by disabling the "TCP/IP NetBIOS Helper" service then the lookup takes up to 60s to timeout.
Stop-Service lmhosts
Set-Service -name lmhosts -startupType disabled
We have a tester that only calls the NetUserGetInfo call. Source is available at https://gist.github.com/andrewkroh/851a9db304401068d2ba121d5b39e3c9#file-netusergetinfo-go.
When the netbios cache is cleared this reliable takes 60 seconds before the call fails. This is why https://pkg.go.dev/os/user#Current takes up to 60 seconds the first time it is invokes when running as a Windows service account. Any future calls to user.Current()
receive a cached response from Go package (not from the OS).
PS C:\> nbtstat.exe -R
Successful purge and preload of the NBT Remote Cache Name Table.
PS C:\> .\netusergetinfo.exe -server-name "NT AUTHORITY" -username "SYSTEM"
2022/06/03 18:21:10 Calling NetUserGetInfo("NT AUTHORITY", "SYSTEM", 10, <buf>)
2022/06/03 18:22:10 Exited NetUserGetInfo
2022/06/03 18:22:10 Error:The RPC server is unavailable.
The draft patch in https://github.com/elastic/beats/pull/31823 (commit 491bd97d0e30a7bd5ab1f1b25fa1f6ee2e289ae4) was tested and it does fix the service timeout issue on Windows machines where "TCP/IP NetBIOS Helper" is disabled.
The next steps are to prepare a patch the main
branch, get it reviewed, then backport the fix into the currently maintained branches. For main
, we'll need to open a patch to elastic-agent-libs because npipe was moved out of Beats.
Remove unnecessary calls to user.Current()
from the critical path to Windows service initialization.
The dependency on https://github.com/elastic/glog was removed. It was calling user.Current()
from an init() function.
libbeat's main initialization code was changed make the user.Current()
call asynchronously from the main initialization goroutine.
libbeat's npipe (named pipe) listener was modified to not require user.Current()
because it only needed the current SID.
These are updated ~daily (assuming there are no failures). So it may take a day before the builds include these changes. You can check the date or the commit returned by the API to see if they are updated.
Users report Beats (including at least Filebeat and Winlogbeat) sometimes timeout when starting them as a Windows service.
We captured a core dump of the process when it timeout. It looks like the
github.com/golang/glog
package is making syscalls during the init phase that are blocking the application from initializing.My theory is that because the calls are happening so early in the process lifecycle that something is not initialized and this leads to the problem.
Evidence
Multiple core dumps were taken while the service was attempting to start by using
procdump.exe -ma -n 14 -s 5 winlogbeat
(every 5s over 70s). This same goroutine 1 stack was found in the first dump and one taken ~50 seconds later.Solutions
Let's see if we can stop this call from happening during the init phase.
So what requires this package? The libbeat persistent queue requires dgraph, and github.com/dgraph-io/ristretto uses glog.
This is not our first problem with glog. Previous in https://github.com/elastic/beats/pull/27351 we forked it to fix a problem with it polluting the global FlatSet in the stdlib
flag
package. Another problem is that is starts goroutine during init that cannot be stopped and therefore consume some resources.I propose we fork github.com/dgraph-io/ristretto/z and replace glog with stdlib log. It will be drop in replacement given that the package is only using
Fatal
andFatalf
(which is a terrible idea for a library b/c it callsos.Exit()
).What is the k8s.io/klog/v2 in the goroutine dump?
k8s.io/klog/v2 is fork of github.com/golang/glog adopted by k8s libraries to fix some of the previously mentioned problems (global flags, removal user lookup syscalls from init). It still starts a goroutine through.