elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
585 stars 225 forks source link

[main branch] unexplained long hang/pauses in CI runners causing build timeouts #2624

Open cachedout opened 2 years ago

cachedout commented 2 years ago

We detected what appears to be a hang in the test suite when running .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8":

https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406

Relevant log entries:

<snip>

[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-926)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-927)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-928)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-929)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-930)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-931)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-932)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-933)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-934)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-935)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-936)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-937)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-938)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-939)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-940)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-941)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-942)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-943)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-944)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-945)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-946)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-947)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-948)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-949)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-950)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-951)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-952)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-953)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-954)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-955)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-956)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406#step-4196-log-957)[2022-03-28T07:11:49.510Z] node_tests_1  | -- required packages ["graphql@^0.12.0","apollo-server-express@2.11.0"]
[2022-03-28T07:11:49.510Z] node_tests_1  | -- installing ["graphql@^0.12.0","apollo-server-express@2.11.0"]
[2022-03-28T07:11:59.500Z] node_tests_1  | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
[2022-03-28T07:12:01.408Z] node_tests_1  | -- running preinstall "npm uninstall express-graphql" for apollo-server-express
[2022-03-28T07:12:11.426Z] node_tests_1  | npm WARN nosup Unsupported engine for apollo-server-express@3.6.6: wanted: {"node":">=12.0"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:11.427Z] node_tests_1  | npm WARN notsup Not compatible with your version of node/npm: apollo-server-express@3.6.6
[2022-03-28T07:12:11.427Z] node_tests_1  | npm WARN notsup Unsupported engine for graphql@15.8.0: wanted: {"node":">= 10.x"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:11.427Z] node_tests_1  | npm WARN notsup Not compatible with your version of node/npm: graphql@15.8.0
[2022-03-28T07:12:11.427Z] node_tests_1  | npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@~2.3.2 (node_modules/chokidar/node_modules/fsevents):
[2022-03-28T07:12:11.427Z] node_tests_1  | npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for fsevents@2.3.2: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})
[2022-03-28T07:12:11.427Z] node_tests_1  | 
[2022-03-28T07:12:11.427Z] node_tests_1  | -- required packages ["graphql@^0.12.0","apollo-server-express@2.10.1"]
[2022-03-28T07:12:11.427Z] node_tests_1  | -- installing ["graphql@^0.12.0","apollo-server-express@2.10.1"]
[2022-03-28T07:12:23.643Z] node_tests_1  | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
[2022-03-28T07:12:24.213Z] node_tests_1  | -- running preinstall "npm uninstall express-graphql" for apollo-server-express
[2022-03-28T07:12:34.199Z] node_tests_1  | npm WARN notsup Unsupported engine for apollo-server-express@3.6.6: wanted: {"node":">=12.0"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:34.199Z] node_tests_1  | npm WARN notsup Not compatible with your version of node/npm: apollo-server-express@3.6.6
[2022-03-28T07:12:34.199Z] node_tests_1  | npm WARN notsup Unsupported engine for graphql@15.8.0: wanted: {"node":">= 10.x"} (current: {"node":"8.17.0","npm":"6.13.4"})
[2022-03-28T07:12:34.199Z] node_tests_1  | npm WARN notsup Not compatible with your version of node/npm: graphql@15.8.0
[2022-03-28T07:12:34.199Z] node_tests_1  | npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@~2.3.2 (node_modules/chokidar/node_modules/fsevents):
[2022-03-28T07:12:34.199Z] node_tests_1  | npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for fsevents@2.3.2: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})
[2022-03-28T07:12:34.199Z] node_tests_1  | 
[2022-03-28T07:12:34.199Z] node_tests_1  | -- required packages ["graphql@^0.12.0","apollo-server-expres@2.9.16"]
[2022-03-28T07:12:34.199Z] node_tests_1  | -- installing ["graphql@^0.12.0","apollo-server-express@2.9.16"]
[2022-03-28T07:12:46.409Z] node_tests_1  | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
[2022-03-28T07:12:47.798Z] node_tests_1  | -- ok
[2022-03-28T08:48:52.336Z] Sending interrupt signal to process
[2022-03-28T08:48:58.013Z] Stopping docker_node_tests_1 ... 
[2022-03-28T08:48:58.014Z] Killing docker_node_tests_1  ... 
[2022-03-28T08:48:58.014Z] Killing docker_node_tests_1  ... done
[2022-03-28T08:48:58.014Z] Gracefully stopping... (press Ctrl+C again to force)
[2022-03-28T08:48:58.331Z] Sending interrupt signal to process
[2022-03-28T08:48:58.353Z] script returned exit code 2
trentm commented 2 years ago

tl;dr: I think this is better explained as "unexpected hangs/pauses in CI runner VMs".

The 1h36m hang in the above log looks like a test hang on its own, however looking at some surrounding tests makes it feel like something else is going on.

  1. I cannot repro the hang when running locally:
[14:42:28 trentm@pink:~/el/apm-agent-nodejs (git:main)]
% time .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
...
node_tests_1  |
node_tests_1  | -- required packages ["graphql@^0.12.0","apollo-server-express@2.9.16"]
node_tests_1  | -- installing ["graphql@^0.12.0","apollo-server-express@2.9.16"]
node_tests_1  | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
node_tests_1  | -- ok
docker_node_tests_1 exited with code 0
Aborting on container exit...
.ci/scripts/test.sh:268: NODE_VERSION=8
.ci/scripts/test.sh:268: docker-compose --no-ansi --log-level ERROR -f .ci/docker/docker-compose-node-test.yml down -v --remove-orphans
--no-ansi option is deprecated and will be removed in future versions. Use `--ansi never` instead.
Removing docker_node_tests_1 ...
Removing docker_node_tests_1 ... done
.ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"  2.94s user 0.68s system 0% cpu 36:52.35 total
  1. I cannot repro the hang in the Node v8 + apollo-server-express step in CI: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/526 completed. However there is a different step in this build 96 that is hanging: the .ci/scripts/test.sh -b "release" -t "bluebird" "12" step: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458

  2. Our "TAV Test" stage is the longest. It typically and reliably has been taking slightly over 1h to complete. However, starting some time after Mar 23rd, that time started getting unexpectedly longer. A screenshot of the current https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/

Screen Shot 2022-03-28 at 2 53 13 PM

The preceding build 94 actually passed (because the full build run completed in 2 hr 53 min, slightly under the 3h timeout set on the build), but took an abnormally long time. Using the dev util script to list the slowest Jenkins build steps it was the jade TAV test with Node v12 that took an abnormally long time (6104897ms = ~101 minutes):

% ./dev-utils/jenkins-build-slow-steps.sh https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/94/ | tail
1311604 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "14"
1323594 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "pg" "12"
1327722 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
1330001 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "10"
1441287 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "graphql" "14"
1465226 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "graphql" "12"
1519466 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "12"
1836015 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "12"
1901324 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "14"
6104897 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "jade" "12"

From the log for that build step, there was an unexplained 1h40m pause during its npm install:

...
[2022-03-25T06:39:19.419Z] node_tests_1  | npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated
[2022-03-25T06:39:19.419Z] node_tests_1  | npm WARN deprecated source-map-url@0.4.1: See https://github.com/lydell/source-map-url#deprecated
[2022-03-25T06:39:19.419Z] node_tests_1  | npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated
[2022-03-25T08:20:15.644Z] node_tests_1  |
[2022-03-25T08:20:15.645Z] node_tests_1  | > dtrace-provider@0.8.8 install /app/node_modules/dtrace-provider
[2022-03-25T08:20:15.645Z] node_tests_1  | > node-gyp rebuild || node suppress-error.js
...
trentm commented 2 years ago

@cachedout Any ideas? Do you know if other projects have seen slow/hung/timing-out test runner runs?

trentm commented 2 years ago

That build that I mentioned in point "2." above eventually took ~122 minutes and then did complete. Its log https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458 includes:

...
[2022-03-28T20:48:50.584Z] node_tests_1  | npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-113)
[2022-03-28T20:48:50.584Z] node_tests_1  | npm WARN deprecated source-map-url@0.4.1: See https://github.com/lydell/source-map-url#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-114)
[2022-03-28T20:48:50.584Z] node_tests_1  | npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-115)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-116)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-117)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-118)
[2022-03-28T22:29:31.259Z] node_tests_1  | 
[2022-03-28T22:29:31.263Z] node_tests_1  | > dtrace-provider@0.8.8 install /app/node_modules/dtrace-provider
[2022-03-28T22:29:31.263Z] node_tests_1  | > node-gyp rebuild || node suppress-error.js
[2022-03-28T22:29:31.263Z] node_tests_1  |
...

which shows a 1h40m delay during the initial npm install.

That seems suspiciously similar to the "there was an unexplained 1h40m pause during its npm install" from point "3." above! The time difference in ms between the log messages before and after the pause in both of these logs are:

> t2 = new Date('2022-03-28T22:29:31.259Z')
2022-03-28T22:29:31.259Z
> t1 = new Date('2022-03-28T20:48:50.584Z')
2022-03-28T20:48:50.584Z
> t2 - t1
6040675
>
>
> t2 = new Date('2022-03-25T08:20:15.644Z')
2022-03-25T08:20:15.644Z
> t1 = new Date('2022-03-25T06:39:19.419Z')
2022-03-25T06:39:19.419Z
> t2 - t1
6056225

Pretty close. They are also close to 100 minutes expressed in ms. That could be an error timeout somewhere... perhaps in the depths of npm?

This doesn't explain the original pause or hang in the original log (for the node 8 + apollo-server-express case).

trentm commented 2 years ago

I'm running https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/97/ to try again to see if we get a similar pause somewhere.

cachedout commented 2 years ago

@cachedout Any ideas? Do you know if other projects have seen slow/hung/timing-out test runner runs?

@trentm Nothing immediately comes to mind but I did just return from paternity leave so I'm still getting my head around the current state of affairs.

@v1v Do any changes in the above time frame ring a bell for you?

v1v commented 2 years ago

@v1v Do any changes in the above time frame ring a bell for you?

I wonder if a recent docker upgrade has happened in the CI workers, I recall I read something about a recent upgrade did affect something somewhere (sorry If I'm not precise but I don't recall exactly where I read it)

If those tests uses docker, maybe that's one of the reasons...

v1v commented 2 years ago

Our "TAV Test" stage is the longest. It typically and reliably has been taking slightly over 1h to complete. However, starting some time after Mar 23rd, that time started getting unexpectedly longer.

As far as I see docker has changed recently, see the below output from docker version in one of the CI workers:

Client: Docker Engine - Community
 Version:           20.10.14
 API version:       1.41
 Built:             Thu Mar 24 01:47:57 2022
...

Server: Docker Engine - Community
 Engine:
  Version:          20.10.14
  API version:      1.41 (minimum version 1.12)
  Built:            Thu Mar 24 01:45:46 2022
trentm commented 2 years ago

If those tests uses docker,

They do use docker.

cachedout commented 2 years ago

Refs: https://github.com/elastic/infra/issues/35413

trentm commented 2 years ago

For what it is worth this seems to have "gone away". Looking at recent apm-agent-nodejs build durations: Screen Shot 2022-03-30 at 9 21 49 AM

I'm closing this as "cannot repro". We can re-open if things start failing or looking slow again. I'm not sure how we could improve observability of the build steps from our end.

v1v commented 2 years ago

Interesting, I've just enabled https://github.com/elastic/apm-agent-nodejs/pull/2629 to run a few times with a pre-docker upgrade image.

I'll keep an eye for one day and then provide further details here before closing it.

v1v commented 2 years ago

crossposting -> https://github.com/elastic/apm-agent-nodejs/pull/2629#issuecomment-1085833914

trentm commented 2 years ago

THis is happening again: Screen Shot 2022-04-13 at 2 14 38 PM

For the latest one (build 114) it was the "Test / Node.js-14-async-hooks-false" build step that had an unexplained hang. From the log:

...
[2022-04-13T05:54:51.095Z] node_tests_1     | npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated
[2022-04-13T05:54:51.095Z] node_tests_1     | npm WARN deprecated source-map-url@0.4.1: See https://github.com/lydell/source-map-url#deprecated
[2022-04-13T05:54:51.095Z] node_tests_1     | npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated
[2022-04-13T07:35:47.210Z] node_tests_1     | 
[2022-04-13T07:35:47.211Z] node_tests_1     | > dtrace-provider@0.8.8 install /app/node_modules/dtrace-provider
[2022-04-13T07:35:47.211Z] node_tests_1     | > node-gyp rebuild || node suppress-error.js
...

That delay was again (see earlier comment https://github.com/elastic/apm-agent-nodejs/issues/2624#issuecomment-1081258866) approximately 100 minutes:

> t2 = new Date('2022-04-13T07:35:47.210Z')
2022-04-13T07:35:47.210Z
> t1 = new Date('2022-04-13T05:54:51.095Z')
2022-04-13T05:54:51.095Z
> t2 - t1
6056115
> (t2 - t1) / 1000 / 60
100.93525

and in the same part of npm install. Could it be the binary build step for the 'dtrace-provider' npm package? That consistent 100 minutes suggests some hardcoded timeout.