[libbeat] Windows service timeouts

andrewkroh commented 2 years ago

Users report Beats (including at least Filebeat and Winlogbeat) sometimes timeout when starting them as a Windows service.

We captured a core dump of the process when it timeout. It looks like the github.com/golang/glog package is making syscalls during the init phase that are blocking the application from initializing.

My theory is that because the calls are happening so early in the process lifecycle that something is not initialized and this leads to the problem.

Evidence

Multiple core dumps were taken while the service was attempting to start by using procdump.exe -ma -n 14 -s 5 winlogbeat (every 5s over 70s). This same goroutine 1 stack was found in the first dump and one taken ~50 seconds later.

PS C:\Users\andrew_kroh\Downloads> ~\go\bin\dlv core .\winlogbeat-8.2.0-windows-x86_64\winlogbeat.exe .\winlogbeat.exe_220602_040444.dmp
Type 'help' for list of commands.
(dlv) grs
* Goroutine 1 - User: /usr/local/go/src/runtime/syscall_windows.go:497 syscall.Syscall6 (0x113789a) (thread 3684) [chan receive]
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0x110bf56) [force gc (idle) 459058h40m25.7713644s]
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0x110bf56) [GC sweep wait]
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0x110bf56) [GC scavenge wait]
  Goroutine 5 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0x110bf56) [finalizer wait 459058h40m25.7879775s]
  Goroutine 8 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0x110bf56) [GC worker (idle)]
  Goroutine 11 - User: /go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x1b8b474) [chan receive 459058h40m25.8026144s]
  Goroutine 12 - User: /go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x1b8b474) [chan receive 459058h40m25.8124078s]
  Goroutine 20 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0x110bf56) [GC worker (idle)]
  Goroutine 24 - User: /go/pkg/mod/k8s.io/klog/v2@v2.30.0/klog.go:1181 k8s.io/klog/v2.(*loggingT).flushDaemon (0x1d695aa) [chan receive]
  Goroutine 26 - User: /go/pkg/mod/github.com/elastic/glog@v1.0.1-0.20210831205241-7d8b5c89dfc4/glog.go:890 github.com/golang/glog.(*loggingT).flushDaemon (0x28fdb0a) [chan receive]
[11 goroutines]
(dlv)

(dlv) gr 1
Switched from 1 to 1 (thread 3684)
(dlv) stack
 0  0x00007fff4eec6724 in ???
    at ?:-1
 1  0x0000000001139120 in runtime.systemstack_switch
    at /usr/local/go/src/runtime/asm_amd64.s:350
 2  0x00000000010d430c in runtime.cgocall
    at /usr/local/go/src/runtime/cgocall.go:166
 3  0x000000000113789a in syscall.Syscall6
    at /usr/local/go/src/runtime/syscall_windows.go:497
 4  0x000000000113dcba in syscall.Syscall6
    at :0
 5  0x00000000011581dc in syscall.NetUserGetInfo
    at /usr/local/go/src/syscall/zsyscall_windows.go:1211
 6  0x00000000012c63f6 in os/user.lookupFullNameServer
    at /usr/local/go/src/os/user/lookup_windows.go:41
 7  0x00000000012c6666 in os/user.lookupFullName
    at /usr/local/go/src/os/user/lookup_windows.go:58
 8  0x00000000012c6b67 in os/user.newUser
    at /usr/local/go/src/os/user/lookup_windows.go:181
 9  0x00000000012c6f07 in os/user.current
    at /usr/local/go/src/os/user/lookup_windows.go:225
10  0x00000000012c79d7 in os/user.Current.func1
    at /usr/local/go/src/os/user/lookup.go:15
11  0x0000000001148912 in sync.(*Once).doSlow
    at /usr/local/go/src/sync/once.go:68
12  0x00000000012c6037 in sync.(*Once).Do
    at /usr/local/go/src/sync/once.go:59
13  0x00000000012c6037 in os/user.Current
    at /usr/local/go/src/os/user/lookup.go:15
14  0x00000000028fdeb2 in github.com/golang/glog.init.1
    at /go/pkg/mod/github.com/elastic/glog@v1.0.1-0.20210831205241-7d8b5c89dfc4/glog_file.go:63
15  0x0000000001118a2a in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6498
16  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
17  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
18  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
19  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
20  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
21  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
22  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
23  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
24  0x0000000001118971 in runtime.doInit
    at /usr/local/go/src/runtime/proc.go:6475
25  0x000000000110bb78 in runtime.main
    at /usr/local/go/src/runtime/proc.go:238
26  0x000000000113b4a1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1581

Solutions

Let's see if we can stop this call from happening during the init phase.

So what requires this package? The libbeat persistent queue requires dgraph, and github.com/dgraph-io/ristretto uses glog.

% go mod why github.com/golang/glog
# github.com/golang/glog
github.com/elastic/beats/v7/x-pack/libbeat/persistentcache
github.com/dgraph-io/badger/v3
github.com/dgraph-io/ristretto/z
github.com/golang/glog

This is not our first problem with glog. Previous in https://github.com/elastic/beats/pull/27351 we forked it to fix a problem with it polluting the global FlatSet in the stdlib flag package. Another problem is that is starts goroutine during init that cannot be stopped and therefore consume some resources.

I propose we fork github.com/dgraph-io/ristretto/z and replace glog with stdlib log. It will be drop in replacement given that the package is only using Fatal and Fatalf (which is a terrible idea for a library b/c it calls os.Exit()).

What is the k8s.io/klog/v2 in the goroutine dump?

k8s.io/klog/v2 is fork of github.com/golang/glog adopted by k8s libraries to fix some of the previously mentioned problems (global flags, removal user lookup syscalls from init). It still starts a goroutine through.

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

andrewkroh commented 2 years ago

It looks like my theory was wrong. Deferring the call to after the init phase did allow the process to progress, but the same call hangs in other places leading other blocked goroutines.

andrewkroh commented 2 years ago

That call that is hanging is to NetUserGetInfo("NT AUTHORITY", "SYSTEM", 10, buffer). The service runs as the system account. It's odd to see the server name parameter of the call be populated with NT AUTHORITY. It would seem like that could lead to an attempt to make an RPC call.

The code in lookupFullName seems to be making an assumption that lookupFullNameServer will fail fast if the suspected domain name ("NT AUTHORITY") is not a real server.

In this particular case the computer is not a member of domain, but it is a member of a workgroup.

andrewkroh commented 2 years ago

We found two other mentions of similar issues when using NetUserGetInfo.

https://github.com/PowerShell/openssh-portable/pull/286#issuecomment-379571000

My issue wasn't in a multi-domain environment, but rather an environment that has no NetBIOS resolution available. This caused the initial timeout for NetUserGetInfo

https://github.com/go-delve/delve/issues/2072

... session gets timed out after 60 seconds waiting for Deleve to start up ... on a domain-joined computer that is currently not connected to the domain.

andrewkroh commented 2 years ago

We have a reliable means of reproducing. If the Windows hosts has been hardened by disabling the "TCP/IP NetBIOS Helper" service then the lookup takes up to 60s to timeout.

Stop-Service lmhosts
Set-Service -name lmhosts -startupType disabled

We have a tester that only calls the NetUserGetInfo call. Source is available at https://gist.github.com/andrewkroh/851a9db304401068d2ba121d5b39e3c9#file-netusergetinfo-go.

When the netbios cache is cleared this reliable takes 60 seconds before the call fails. This is why https://pkg.go.dev/os/user#Current takes up to 60 seconds the first time it is invokes when running as a Windows service account. Any future calls to user.Current() receive a cached response from Go package (not from the OS).

PS C:\> nbtstat.exe -R
    Successful purge and preload of the NBT Remote Cache Name Table.

PS C:\> .\netusergetinfo.exe -server-name "NT AUTHORITY" -username "SYSTEM"
2022/06/03 18:21:10 Calling NetUserGetInfo("NT AUTHORITY", "SYSTEM", 10, <buf>)
2022/06/03 18:22:10 Exited NetUserGetInfo
2022/06/03 18:22:10 Error:The RPC server is unavailable.

andrewkroh commented 2 years ago

The draft patch in https://github.com/elastic/beats/pull/31823 (commit 491bd97d0e30a7bd5ab1f1b25fa1f6ee2e289ae4) was tested and it does fix the service timeout issue on Windows machines where "TCP/IP NetBIOS Helper" is disabled.

The next steps are to prepare a patch the main branch, get it reviewed, then backport the fix into the currently maintained branches. For main, we'll need to open a patch to elastic-agent-libs because npipe was moved out of Beats.

andrewkroh commented 2 years ago

Summary of resolution

Remove unnecessary calls to user.Current() from the critical path to Windows service initialization.

The dependency on https://github.com/elastic/glog was removed. It was calling user.Current() from an init() function.
libbeat's main initialization code was changed make the user.Current() call asynchronously from the main initialization goroutine.
libbeat's npipe (named pipe) listener was modified to not require user.Current() because it only needed the current SID.

PRs

main branch

8.3 branch

8.2 branch

Fixed versions

8.3.0
8.2.3

Snapshots builds for testing

These are updated ~daily (assuming there are no failures). So it may take a day before the builds include these changes. You can check the date or the commit returned by the API to see if they are updated.

elastic / beats