Open michalpristas opened 5 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
@leehinman after I spent all my timboxed time and talked to Julien and Pierre I'll let you to take over
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Note to self https://pkg.go.dev/runtime the inittrace=1
option should give insight into where time is being spent.
So I'm not entirely sure it is init
that is causing the problem. I set GODEBUG=inittrace=1. And did elastic-agent run
under Windows 2016.
This took 406ms to get through init
tail -n 1 before_k8_4ccdd0.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @406 ms, 0 ms clock, 18208 bytes, 123 allocs
And one package took up over half that time, the k8s.io/api/extensions/v1beta1
package
awk '$5 > 0 {print $5, $2}' before_k8_4ccdd0.log | sort -rn
244 k8s.io/api/extensions/v1beta1
18 go.elastic.co/apm
12 github.com/aws/aws-sdk-go/aws/endpoints
10 github.com/jedib0t/go-pretty/v6/text
8.6 github.com/shirou/gopsutil/process
8.2 github.com/prometheus/client_golang/prometheus
8.0 k8s.io/client-go/tools/clientcmd
7.2 github.com/elastic/elastic-agent/internal/pkg/agent/application
5.3 github.com/shirou/gopsutil/cpu
4.1 go.elastic.co/apm/v2
4.0 k8s.io/client-go/kubernetes/scheme
3.8 github.com/elastic/elastic-agent-libs/match
2.0 go.elastic.co/apm/stacktrace
1.9 k8s.io/apimachinery/pkg/util/validation
1.9 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl
1.9 github.com/hashicorp/go-version
1.7 internal/poll
1.0 k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json
1.0 gopkg.in/yaml%2ev3
1.0 go.opentelemetry.io/collector/service/internal/zpages
1.0 go.opencensus.io/trace/tracestate
1.0 google.golang.org/protobuf/reflect/protodesc
1.0 google.golang.org/grpc/codes
1.0 github.com/prometheus/common/expfmt
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/gcp
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/env
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/parser/container
1.0 github.com/google/go-cmp/cmp
1.0 github.com/elastic/go-sysinfo/providers/linux
1.0 github.com/elastic/go-elasticsearch/v7/estransport
1.0 github.com/elastic/elastic-agent/version
1.0 github.com/elastic/elastic-agent/internal/pkg/agent/transpiler
1.0 github.com/distribution/reference
1.0 crypto/tls
0.99 sigs.k8s.io/structured-merge-diff/v4/typed
0.99 k8s.io/client-go/tools/clientcmd/api/latest
0.90 runtime/metrics
0.71 github.com/shirou/gopsutil/v3/cpu
0.62 golang.org/x/net/trace
0.23 crypto/x509
This only took 206ms in init
tail -n 1 after_k8_with_windows.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @206 ms, 0 ms clock, 18208 bytes, 123 allocs
And this time it was github.com/leoluk/perflib_exporter/perflib
that took up most of the time.
151 github.com/leoluk/perflib_exporter/perflib
9.0 go.elastic.co/apm
5.9 github.com/jedib0t/go-pretty/v6/text
3.9 k8s.io/client-go/kubernetes/scheme
2.9 github.com/aws/aws-sdk-go/aws/endpoints
1.1 go.elastic.co/apm/v2
1.0 time
1.0 k8s.io/klog/v2
1.0 k8s.io/client-go/tools/clientcmd
1.0 k8s.io/api/flowcontrol/v1beta3
1.0 k8s.io/api/core/v1
1.0 go.opencensus.io/resource
1.0 go.elastic.co/apm/v2/stacktrace
1.0 github.com/shirou/gopsutil/process
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/env
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl
1.0 github.com/fatih/color
1.0 github.com/elastic/go-structform/gotype
1.0 github.com/elastic/elastic-agent/internal/pkg/agent/application/paths
1.0 github.com/distribution/reference
1.0 github.com/aws/aws-sdk-go/aws/request
1.0 github.com/alecthomas/participle/v2/lexer
0.99 k8s.io/apimachinery/pkg/runtime
0.99 internal/poll
0.99 google.golang.org/protobuf/internal/detrand
0.98 github.com/elastic/go-sysinfo/providers/linux
0.59 gopkg.in/yaml%2ev2
This went through init much faster, only 48ms.
tail -n 1 after_k8_no_windows.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @48 ms, 0 ms clock, 18224 bytes, 123 allocs
With go.elastic.co/apm
taking up most of the time
14 go.elastic.co/apm
4.9 github.com/jedib0t/go-pretty/v6/text
2.0 go.elastic.co/apm/v2
1.9 k8s.io/client-go/kubernetes/scheme
1.0 sigs.k8s.io/structured-merge-diff/v4/typed
1.0 k8s.io/api/policy/v1
1.0 k8s.io/api/autoscaling/v2
1.0 k8s.io/api/apidiscovery/v2beta1
1.0 internal/poll
1.0 github.com/elastic/go-sysinfo/providers/linux
1.0 github.com/elastic/elastic-agent-libs/logp
1.0 github.com/elastic/elastic-agent-client/v7/pkg/proto
1.0 github.com/distribution/reference
0.99 google.golang.org/protobuf/reflect/protodesc
0.99 github.com/kardianos/service
0.99 github.com/hectane/go-acl/api
0.99 github.com/elastic/go-structform/gotype
0.98 github.com/shirou/gopsutil/process
So #4908 actually made init
faster than it was before (even without the build constraints) so it seems unlikely that increasing init
time was the root cause.
Next directions to look at
https://github.com/golang/go/issues/23479, specifically https://go-review.googlesource.com/c/sys/+/246317/3/windows/svc/sys_amd64.s and https://github.com/shirou/gopsutil/issues/570. These seem to suggest that DLL initialization may be an issue for us.
this is a really nice finding. not sure what we can do about that at this point though. i don't think we can optimize for DLL load time, especially in dependencies.
At the moment, Agent is unable to run in OTel mode on Windows — support for Windows was explicitly removed by adding build constraints as part of https://github.com/elastic/elastic-agent/pull/4908. If we remove these constraints we see failures in CI like so:
failed to start service (Elastic Agent): The service did not respond to the start or control request in a timely fashion.
One of the reason may be late reaction to the control events from a service manager and invocation of
service.ProcessWindowsControlEvents
as described in #4971Failures are present even if we move this to init function of alphabetically very early dependencies in main. Timing here is also tricky, as dependencies get initialized before package linking them. so calling something in init does not mean it's being called first.
Event Log for elastic agent seems empty. Occasionally in linked draft above we see
24304 Jun 21 05:05 Error Elastic Agent 1 The description for Event ID '1' in Source 'Ela...
that could uncover some failures, but this is not present everywhere. But as i had this timeboxed i stopped herehttps://techcommunity.microsoft.com/t5/iis-support-blog/the-description-for-event-id-1-from-source-getaccess-iis-runtime/ba-p/3744637