Open cachedout opened 2 years ago
tl;dr: I think this is better explained as "unexpected hangs/pauses in CI runner VMs".
The 1h36m hang in the above log looks like a test hang on its own, however looking at some surrounding tests makes it feel like something else is going on.
[14:42:28 trentm@pink:~/el/apm-agent-nodejs (git:main)]
% time .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
...
node_tests_1 |
node_tests_1 | -- required packages ["graphql@^0.12.0","apollo-server-express@2.9.16"]
node_tests_1 | -- installing ["graphql@^0.12.0","apollo-server-express@2.9.16"]
node_tests_1 | -- running test "node test/instrumentation/modules/apollo-server-express.test.js" with apollo-server-express
node_tests_1 | -- ok
docker_node_tests_1 exited with code 0
Aborting on container exit...
.ci/scripts/test.sh:268: NODE_VERSION=8
.ci/scripts/test.sh:268: docker-compose --no-ansi --log-level ERROR -f .ci/docker/docker-compose-node-test.yml down -v --remove-orphans
--no-ansi option is deprecated and will be removed in future versions. Use `--ansi never` instead.
Removing docker_node_tests_1 ...
Removing docker_node_tests_1 ... done
.ci/scripts/test.sh -b "release" -t "apollo-server-express" "8" 2.94s user 0.68s system 0% cpu 36:52.35 total
I cannot repro the hang in the Node v8 + apollo-server-express step in CI: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/526 completed. However there is a different step in this build 96 that is hanging: the .ci/scripts/test.sh -b "release" -t "bluebird" "12"
step: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458
Our "TAV Test" stage is the longest. It typically and reliably has been taking slightly over 1h to complete. However, starting some time after Mar 23rd, that time started getting unexpectedly longer. A screenshot of the current https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/
The preceding build 94 actually passed (because the full build run completed in 2 hr 53 min, slightly under the 3h timeout set on the build), but took an abnormally long time. Using the dev util script to list the slowest Jenkins build steps it was the jade TAV test with Node v12 that took an abnormally long time (6104897ms = ~101 minutes):
% ./dev-utils/jenkins-build-slow-steps.sh https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/94/ | tail
1311604 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "14"
1323594 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "pg" "12"
1327722 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
1330001 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "10"
1441287 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "graphql" "14"
1465226 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "graphql" "12"
1519466 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "knex" "12"
1836015 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "12"
1901324 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "apollo-server-express" "14"
6104897 FINISHED SUCCESS .ci/scripts/test.sh -b "release" -t "jade" "12"
From the log for that build step, there was an unexplained 1h40m pause during its npm install
:
...
[2022-03-25T06:39:19.419Z] node_tests_1 | npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated
[2022-03-25T06:39:19.419Z] node_tests_1 | npm WARN deprecated source-map-url@0.4.1: See https://github.com/lydell/source-map-url#deprecated
[2022-03-25T06:39:19.419Z] node_tests_1 | npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated
[2022-03-25T08:20:15.644Z] node_tests_1 |
[2022-03-25T08:20:15.645Z] node_tests_1 | > dtrace-provider@0.8.8 install /app/node_modules/dtrace-provider
[2022-03-25T08:20:15.645Z] node_tests_1 | > node-gyp rebuild || node suppress-error.js
...
@cachedout Any ideas? Do you know if other projects have seen slow/hung/timing-out test runner runs?
That build that I mentioned in point "2." above eventually took ~122 minutes and then did complete. Its log https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458 includes:
...
[2022-03-28T20:48:50.584Z] node_tests_1 | npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-113)
[2022-03-28T20:48:50.584Z] node_tests_1 | npm WARN deprecated source-map-url@0.4.1: See https://github.com/lydell/source-map-url#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-114)
[2022-03-28T20:48:50.584Z] node_tests_1 | npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-115)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-116)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-117)[](https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/96/pipeline/458#step-3036-log-118)
[2022-03-28T22:29:31.259Z] node_tests_1 |
[2022-03-28T22:29:31.263Z] node_tests_1 | > dtrace-provider@0.8.8 install /app/node_modules/dtrace-provider
[2022-03-28T22:29:31.263Z] node_tests_1 | > node-gyp rebuild || node suppress-error.js
[2022-03-28T22:29:31.263Z] node_tests_1 |
...
which shows a 1h40m delay during the initial npm install
.
That seems suspiciously similar to the "there was an unexplained 1h40m pause during its npm install" from point "3." above! The time difference in ms between the log messages before and after the pause in both of these logs are:
> t2 = new Date('2022-03-28T22:29:31.259Z')
2022-03-28T22:29:31.259Z
> t1 = new Date('2022-03-28T20:48:50.584Z')
2022-03-28T20:48:50.584Z
> t2 - t1
6040675
>
>
> t2 = new Date('2022-03-25T08:20:15.644Z')
2022-03-25T08:20:15.644Z
> t1 = new Date('2022-03-25T06:39:19.419Z')
2022-03-25T06:39:19.419Z
> t2 - t1
6056225
Pretty close. They are also close to 100 minutes expressed in ms. That could be an error timeout somewhere... perhaps in the depths of npm?
This doesn't explain the original pause or hang in the original log (for the node 8 + apollo-server-express case).
I'm running https://apm-ci.elastic.co/job/apm-agent-nodejs/job/apm-agent-nodejs-mbp/job/main/97/ to try again to see if we get a similar pause somewhere.
@cachedout Any ideas? Do you know if other projects have seen slow/hung/timing-out test runner runs?
@trentm Nothing immediately comes to mind but I did just return from paternity leave so I'm still getting my head around the current state of affairs.
@v1v Do any changes in the above time frame ring a bell for you?
@v1v Do any changes in the above time frame ring a bell for you?
I wonder if a recent docker upgrade has happened in the CI workers, I recall I read something about a recent upgrade did affect something somewhere (sorry If I'm not precise but I don't recall exactly where I read it)
If those tests uses docker, maybe that's one of the reasons...
Our "TAV Test" stage is the longest. It typically and reliably has been taking slightly over 1h to complete. However, starting some time after Mar 23rd, that time started getting unexpectedly longer.
As far as I see docker has changed recently, see the below output from docker version
in one of the CI workers:
Client: Docker Engine - Community
Version: 20.10.14
API version: 1.41
Built: Thu Mar 24 01:47:57 2022
...
Server: Docker Engine - Community
Engine:
Version: 20.10.14
API version: 1.41 (minimum version 1.12)
Built: Thu Mar 24 01:45:46 2022
If those tests uses docker,
They do use docker.
For what it is worth this seems to have "gone away". Looking at recent apm-agent-nodejs build durations:
I'm closing this as "cannot repro". We can re-open if things start failing or looking slow again. I'm not sure how we could improve observability of the build steps from our end.
Interesting, I've just enabled https://github.com/elastic/apm-agent-nodejs/pull/2629 to run a few times with a pre-docker upgrade image.
I'll keep an eye for one day and then provide further details here before closing it.
THis is happening again:
For the latest one (build 114) it was the "Test / Node.js-14-async-hooks-false" build step that had an unexplained hang. From the log:
...
[2022-04-13T05:54:51.095Z] node_tests_1 | npm WARN deprecated resolve-url@0.2.1: https://github.com/lydell/resolve-url#deprecated
[2022-04-13T05:54:51.095Z] node_tests_1 | npm WARN deprecated source-map-url@0.4.1: See https://github.com/lydell/source-map-url#deprecated
[2022-04-13T05:54:51.095Z] node_tests_1 | npm WARN deprecated urix@0.1.0: Please see https://github.com/lydell/urix#deprecated
[2022-04-13T07:35:47.210Z] node_tests_1 |
[2022-04-13T07:35:47.211Z] node_tests_1 | > dtrace-provider@0.8.8 install /app/node_modules/dtrace-provider
[2022-04-13T07:35:47.211Z] node_tests_1 | > node-gyp rebuild || node suppress-error.js
...
That delay was again (see earlier comment https://github.com/elastic/apm-agent-nodejs/issues/2624#issuecomment-1081258866) approximately 100 minutes:
> t2 = new Date('2022-04-13T07:35:47.210Z')
2022-04-13T07:35:47.210Z
> t1 = new Date('2022-04-13T05:54:51.095Z')
2022-04-13T05:54:51.095Z
> t2 - t1
6056115
> (t2 - t1) / 1000 / 60
100.93525
and in the same part of npm install. Could it be the binary build step for the 'dtrace-provider' npm package? That consistent 100 minutes suggests some hardcoded timeout.
We detected what appears to be a hang in the test suite when running
.ci/scripts/test.sh -b "release" -t "apollo-server-express" "8"
:https://apm-ci.elastic.co/blue/organizations/jenkins/apm-agent-nodejs%2Fapm-agent-nodejs-mbp/detail/main/95/pipeline/1406
Relevant log entries: