elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
23 stars 145 forks source link

Build the Docker images can take 10+ minutes #2426

Open cmacknz opened 1 year ago

cmacknz commented 1 year ago

On my M1 MacOS Monterey system building the agent docker target takes 10+ minutes. The command to do this is: DEV=true EXTERNAL=true SNAPSHOT=true PLATFORMS=linux/arm64 PACKAGES=docker mage -v package

This builds at least three docker images: elastic-agent, elastic-agent-cloud, and elastic-agent-complete. Usually only one of these is actually needed so adding an option to control which ones are built would help this.

gsantoro commented 1 year ago

I was about to write a similar issue today since I managed to reduce the build time from 8+ minutes to 3+ minutes by hacking the code a bit. I think we could even reduce the build time further to 1 minute with a few more modifications. Spoiler alert there is also an elastic-agent.ubi distribution now.

I managed these performance improvements with these steps:

  1. Avoid building for distributions that I am not interested in (if like me you need to build elastic-agent to run it locally)
  2. Avoid fecthing beats that I don't need

Additionally, most of the time was spent downloading the beats one by one. We could add this step to further reduce the build time:

  1. we could download the beats in parallel. I have tried to implement this but the function that we use to download the binary is not fully parallelizable.

I will analyze each step one by one:

  1. There is a file packages.yml that we use to define the distributions to build for each combination of platform (os and architecture) and package.

For example one of those rules for the ubi distribution is

- os: linux
  arch: arm64
  types: [docker]
  spec:
    <<: *agent_docker_arm_spec
    <<: *docker_arm_ubi_spec
    <<: *elastic_docker_spec
    <<: *elastic_license_for_binaries
    files:
      '{{.BeatName}}{{.BinaryExt}}':
        source: ./build/golang-crossbuild/{{.BeatName}}-{{.GOOS}}-{{.Platform.Arch}}{{.BinaryExt}}

In order to only build for elastic-agent distribution I ended up commenting out any rules that mentioned:

I am not suggesting that commenting on a file is a long-term solution but we should think of distribution profiles where you can specify from the command line which distribution to build for similarly to what we do now for platforms or architectures.

  1. Avoid downloading unnecessary beats. Code at https://github.com/elastic/elastic-agent/blob/123579115ad7a330779c2f9b5b9e719f98282dc4/magefile.go#L795 download this full list of beats when EXTERNAL=true
"auditbeat", "filebeat", "heartbeat", "metricbeat", "osquerybeat", "packetbeat",
// "cloudbeat", // TODO: add once working
"cloud-defend",
"elastic-agent-shipper",
"apm-server",
"endpoint-security",
"fleet-server",
"pf-elastic-collector",
"pf-elastic-symbolizer",
"pf-host-agent"

while if when EXTERNAL=false it only rebuilds the following beats

"filebeat", "heartbeat", "metricbeat", "osquerybeat"

I am not very familiar with this code but I don't think those two if-branches are doing the same thing. This is because some of those "beats" like apm-server and elastic-agent-shipper have their own separate GitHub repo.

Since I wanted to avoid building the beats from the source (since it takes 28 minutes on my laptop), and I didn't need those extra beats I only built for "filebeat", "heartbeat", "metricbeat", "osquerybeat". I am sure that I could have made the build faster by only fetching a further subset of those beats.

Similarly to step 1 we could think of a beats profile where you provide a list of beats to download/build from the source.

  1. With EXTERNAL=true we fetch each of those beats one by one from the network. Code at https://github.com/elastic/elastic-agent/blob/123579115ad7a330779c2f9b5b9e719f98282dc4/magefile.go#L810. We could instead parallelize the fetch phase to download from the network as fast as your network allows.

I started implementing the parallelization of this code with go functions but unfortunately, the function at https://github.com/elastic/elastic-agent/blob/123579115ad7a330779c2f9b5b9e719f98282dc4/magefile.go#L816 doesn't seem to be parallelizable. I was able to correctly synchronize those functions in parallel with Golang WaitGroup but I think that function is not correctly flushing to disk. In fact, I can't see the full file name of those beats after they are downloaded and before it moves to the next step of the code and it's too late. The entire build fails badly.

I have a feature branch with the previous changes that work for me at https://github.com/gsantoro/elastic-agent/tree/feature/dev-tools-k8s.

cmacknz commented 1 year ago

We definitely need a nicer way to filter down the list of docker images that get built.

We can probably just add an environment variable that lets you specify exactly what gets downloaded. Like EXTERNAL_BINS=filebeat,metricbeat to only download filebeat and metricbeat. This should be straight forward to do.

I am not very familiar with this code but I don't think those two if-branches are doing the same thing. This is because some of those "beats" like apm-server and elastic-agent-shipper have their own separate GitHub repo.

Yes when EXTERNAL=false it looks for the beats repo to exist at the same level as the elastic agent repository and either takes the existing beats packages from it or runs mage package in Beats to build them. The two targets don't do the same thing. TBH I'm not sure how valuable this really is.

It would probably be more valuable to allow EXTERNAL_BINS or whatever we decide to name it to take file paths, URLs, or just binary names to give you complete control over where the dependencies come from.

There is also the DROP_PATH variable that allows you to manually construct the directory binaries will be sourced from, but when constantly changing branches and versions this is a pain to maintain. It's not quite automatic enough.

ycombinator commented 2 months ago

@blakerouse Would your recent improvements via https://github.com/elastic/elastic-agent/pull/5338 resolve this issue?