elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
121 stars 131 forks source link

[Windows] Service startup failing on CI with otel dependencies linked #4976

Open michalpristas opened 2 months ago

michalpristas commented 2 months ago

At the moment, Agent is unable to run in OTel mode on Windows — support for Windows was explicitly removed by adding build constraints as part of https://github.com/elastic/elastic-agent/pull/4908. If we remove these constraints we see failures in CI like so: failed to start service (Elastic Agent): The service did not respond to the start or control request in a timely fashion.

One of the reason may be late reaction to the control events from a service manager and invocation of service.ProcessWindowsControlEvents as described in #4971

Failures are present even if we move this to init function of alphabetically very early dependencies in main. Timing here is also tricky, as dependencies get initialized before package linking them. so calling something in init does not mean it's being called first.

Event Log for elastic agent seems empty. Occasionally in linked draft above we see 24304 Jun 21 05:05 Error Elastic Agent 1 The description for Event ID '1' in Source 'Ela... that could uncover some failures, but this is not present everywhere. But as i had this timeboxed i stopped here

https://techcommunity.microsoft.com/t5/iis-support-blog/the-description-for-event-id-1-from-source-getaccess-iis-runtime/ba-p/3744637

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

michalpristas commented 2 months ago

@leehinman after I spent all my timboxed time and talked to Julien and Pierre I'll let you to take over

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

leehinman commented 2 months ago

Note to self https://pkg.go.dev/runtime the inittrace=1 option should give insight into where time is being spent.

leehinman commented 1 month ago

So I'm not entirely sure it is init that is causing the problem. I set GODEBUG=inittrace=1. And did elastic-agent run under Windows 2016.

Version before 4908

This took 406ms to get through init

tail -n 1 before_k8_4ccdd0.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @406 ms, 0 ms clock, 18208 bytes, 123 allocs

And one package took up over half that time, the k8s.io/api/extensions/v1beta1 package

awk '$5 > 0 {print $5, $2}' before_k8_4ccdd0.log | sort -rn
244 k8s.io/api/extensions/v1beta1
18 go.elastic.co/apm
12 github.com/aws/aws-sdk-go/aws/endpoints
10 github.com/jedib0t/go-pretty/v6/text
8.6 github.com/shirou/gopsutil/process
8.2 github.com/prometheus/client_golang/prometheus
8.0 k8s.io/client-go/tools/clientcmd
7.2 github.com/elastic/elastic-agent/internal/pkg/agent/application
5.3 github.com/shirou/gopsutil/cpu
4.1 go.elastic.co/apm/v2
4.0 k8s.io/client-go/kubernetes/scheme
3.8 github.com/elastic/elastic-agent-libs/match
2.0 go.elastic.co/apm/stacktrace
1.9 k8s.io/apimachinery/pkg/util/validation
1.9 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl
1.9 github.com/hashicorp/go-version
1.7 internal/poll
1.0 k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json
1.0 gopkg.in/yaml%2ev3
1.0 go.opentelemetry.io/collector/service/internal/zpages
1.0 go.opencensus.io/trace/tracestate
1.0 google.golang.org/protobuf/reflect/protodesc
1.0 google.golang.org/grpc/codes
1.0 github.com/prometheus/common/expfmt
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/gcp
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/env
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/parser/container
1.0 github.com/google/go-cmp/cmp
1.0 github.com/elastic/go-sysinfo/providers/linux
1.0 github.com/elastic/go-elasticsearch/v7/estransport
1.0 github.com/elastic/elastic-agent/version
1.0 github.com/elastic/elastic-agent/internal/pkg/agent/transpiler
1.0 github.com/distribution/reference
1.0 crypto/tls
0.99 sigs.k8s.io/structured-merge-diff/v4/typed
0.99 k8s.io/client-go/tools/clientcmd/api/latest
0.90 runtime/metrics
0.71 github.com/shirou/gopsutil/v3/cpu
0.62 golang.org/x/net/trace
0.23 crypto/x509

Version with 4908 but remove the windows build contraints

This only took 206ms in init

tail -n 1 after_k8_with_windows.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @206 ms, 0 ms clock, 18208 bytes, 123 allocs

And this time it was github.com/leoluk/perflib_exporter/perflib that took up most of the time.

151 github.com/leoluk/perflib_exporter/perflib
9.0 go.elastic.co/apm
5.9 github.com/jedib0t/go-pretty/v6/text
3.9 k8s.io/client-go/kubernetes/scheme
2.9 github.com/aws/aws-sdk-go/aws/endpoints
1.1 go.elastic.co/apm/v2
1.0 time
1.0 k8s.io/klog/v2
1.0 k8s.io/client-go/tools/clientcmd
1.0 k8s.io/api/flowcontrol/v1beta3
1.0 k8s.io/api/core/v1
1.0 go.opencensus.io/resource
1.0 go.elastic.co/apm/v2/stacktrace
1.0 github.com/shirou/gopsutil/process
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/env
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl
1.0 github.com/fatih/color
1.0 github.com/elastic/go-structform/gotype
1.0 github.com/elastic/elastic-agent/internal/pkg/agent/application/paths
1.0 github.com/distribution/reference
1.0 github.com/aws/aws-sdk-go/aws/request
1.0 github.com/alecthomas/participle/v2/lexer
0.99 k8s.io/apimachinery/pkg/runtime
0.99 internal/poll
0.99 google.golang.org/protobuf/internal/detrand
0.98 github.com/elastic/go-sysinfo/providers/linux
0.59 gopkg.in/yaml%2ev2

Version 4908 with windows constraints

This went through init much faster, only 48ms.

tail -n 1 after_k8_no_windows.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @48 ms, 0 ms clock, 18224 bytes, 123 allocs

With go.elastic.co/apm taking up most of the time

14 go.elastic.co/apm
4.9 github.com/jedib0t/go-pretty/v6/text
2.0 go.elastic.co/apm/v2
1.9 k8s.io/client-go/kubernetes/scheme
1.0 sigs.k8s.io/structured-merge-diff/v4/typed
1.0 k8s.io/api/policy/v1
1.0 k8s.io/api/autoscaling/v2
1.0 k8s.io/api/apidiscovery/v2beta1
1.0 internal/poll
1.0 github.com/elastic/go-sysinfo/providers/linux
1.0 github.com/elastic/elastic-agent-libs/logp
1.0 github.com/elastic/elastic-agent-client/v7/pkg/proto
1.0 github.com/distribution/reference
0.99 google.golang.org/protobuf/reflect/protodesc
0.99 github.com/kardianos/service
0.99 github.com/hectane/go-acl/api
0.99 github.com/elastic/go-structform/gotype
0.98 github.com/shirou/gopsutil/process

So #4908 actually made init faster than it was before (even without the build constraints) so it seems unlikely that increasing init time was the root cause.

leehinman commented 1 month ago

Next directions to look at

https://github.com/golang/go/issues/23479, specifically https://go-review.googlesource.com/c/sys/+/246317/3/windows/svc/sys_amd64.s and https://github.com/shirou/gopsutil/issues/570. These seem to suggest that DLL initialization may be an issue for us.

michalpristas commented 1 month ago

this is a really nice finding. not sure what we can do about that at this point though. i don't think we can optimize for DLL load time, especially in dependencies.