getodk / central-backend

Node.js based backend for ODK Central
https://docs.getodk.org/central-intro/
Apache License 2.0
50 stars 76 forks source link

Tests hang intermittently #1208

Closed matthew-white closed 1 month ago

matthew-white commented 1 month ago

In CircleCI, make test-ci sometimes hangs, for example, here. It always seems to hang at the same place:

  worker
    runner @slow
      ✔ should return false and do nothing if no event is given
      ✔ should return false and do nothing if no jobs match the event
[2024-09-25T21:26:34.284Z] start processing event test.event::--::undefined (0 jobs)
[2024-09-25T21:26:34.284Z] finish processing event test.event::--::undefined
      ✔ should return true if a job is matched
[2024-09-25T21:26:34.285Z] start processing event test.event::--::undefined (1 jobs)
[2024-09-25T21:26:34.551Z] finish processing event test.event::--::undefined
      ✔ should pass the container and event details to the job (268ms)
make: *** [Makefile:88: test-ci] Terminated

Too long with no output (exceeded 10m0s): context deadline exceeded

I'm also seeing this happen intermittently when I run make test locally.

The following is the last test to pass and the test that follows it. That said, I'm not sure whether the issue is with the test that follows, with some Mocha after-hook, or with something else.

https://github.com/getodk/central-backend/blob/82696a053c87eeb30319c16193223a5fad7b83cf/test/integration/worker/worker.js#L39-L69

I don't see that we changed anything in lib/worker/worker.js in v2024.2: https://github.com/getodk/central-backend/compare/v2024.1.0...v2024.2.0. That makes me suspect that the problem isn't with the worker queue or with the test. However, there is this comment in the test file that might be related:

https://github.com/getodk/central-backend/blob/82696a053c87eeb30319c16193223a5fad7b83cf/test/integration/worker/worker.js#L13-L14

matthew-white commented 1 month ago

I'm noticing a pattern that points to the possibility of something involving S3. When tests hang, I see that the following 6 tests fail. I've noticed that both locally and in CircleCI. (You can see these errors in the CircleCI build that I linked to above.)

        ✔ should not allow failure to affect previous or future uploads
Uploading 0 blobs...
[2024-10-02T18:03:18.325Z] Upload completed.
        1) should not attempt to upload an in-progress blob
        with limit
(node:11432) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
          2) should upload requested number of blobs, and ignore others
          3) should not complain if blob count is less than limit
(node:11432) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
          4) should upload all blobs if limit is zero

...

    ✔ should process the given logs (47ms)
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "POST /v1/sessions HTTP/1.1" 200 228
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "POST /v1/projects/1/forms?publish=true HTTP/1.1" 200 436
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "GET /v1/projects/1/forms/audits/submissions/one/attachments HTTP/1.1" 200 36
    5) should process the given logs if already uploaded to s3
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:31 +0000] "POST /v1/sessions HTTP/1.1" 200 228
    6) should not reprocess already-processed logs

In other words, tests hang intermittently, and when that happens, it's preceded by intermittent test failures.

The first 4 tests are from test/integration/task/s3.js, and the last 2 are from test/integration/worker/submission.attachment.update.js.

This is what the tests look like when they pass. Note that MaxListenersExceededWarning is logged even when the tests pass, so I don't think that's necessarily part of the problem here. One difference that I am seeing is that when the tests pass, there's "Uploading 1 blobs...", but when they fail, there's just "Uploading 0 blobs..."

        ✔ should not allow failure to affect previous or future uploads
Uploading 1 blobs...
Uploading 0 blobs...
[2024-10-02T18:58:31.898Z] Upload completed.
[2024-10-02T18:58:31.903Z] Upload completed.
        ✔ should not attempt to upload an in-progress blob (210ms)
        with limit
(node:46370) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
          ✔ should upload requested number of blobs, and ignore others
          ✔ should not complain if blob count is less than limit
(node:46370) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
          ✔ should upload all blobs if limit is zero

...

    ✔ should process the given logs (64ms)
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/sessions HTTP/1.1" 200 228
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/forms?publish=true HTTP/1.1" 200 436
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "GET /v1/projects/1/forms/audits/submissions/one/attachments HTTP/1.1" 200 36
    ✔ should process the given logs if already uploaded to s3 (52ms)
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/sessions HTTP/1.1" 200 228
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/forms?publish=true HTTP/1.1" 200 436
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "GET /v1/projects/1/forms/audits/submissions/one/attachments HTTP/1.1" 200 36
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
    ✔ should not reprocess already-processed logs (57ms)

@alxndrsn, would you mind taking a look when you get the chance?

alxndrsn commented 1 month ago

I don't see that we changed anything in lib/worker/worker.js in v2024.2: https://github.com/getodk/central-backend/compare/v2024.1.0...v2024.2.0.

The first failing test uses testTaskFullTrx(), which was introduced in that window:

https://github.com/getodk/central-backend/compare/v2024.1.0...v2024.2.0#diff-b1e2d1b1bebb9d6ccc35c71625911ea3a802a4e05363b425d30e396410b06f9dR195-R202

alxndrsn commented 1 month ago

To reproduce locally:

diff --git a/test/integration/task/s3.js b/test/integration/task/s3.js
index 2e7b8789..70be203a 100644
--- a/test/integration/task/s3.js
+++ b/test/integration/task/s3.js
@@ -25,7 +25,7 @@ const assertThrowsAsync = async (fn, expected) => {
   }
 };

-describe('task: s3', () => {
+describe.only('task: s3', () => {
   describe('s3 disabled', () => {
     it('uploadPending() should fail', async () => {
       await assertThrowsAsync(() => uploadPending(), 'S3 blob support is not enabled.');
@@ -179,7 +179,7 @@ describe('task: s3', () => {
           });
           original.apply(global.s3, args);
         };
-        await aBlobExistsWith(container, { status: 'pending' });
+        //await aBlobExistsWith(container, { status: 'pending' });

         // when
         const first = uploadPending();
diff --git a/test/integration/worker/submission.attachment.update.js b/test/integration/worker/submission.attachment.update.js
index fa419ff0..8b425871 100644
--- a/test/integration/worker/submission.attachment.update.js
+++ b/test/integration/worker/submission.attachment.update.js
@@ -5,7 +5,7 @@ const { testService } = require('../setup');
 const testData = require(appRoot + '/test/data/xml.js');
 const worker = require(appRoot + '/lib/worker/submission.attachment.update');

-describe('worker: submission.attachment.update', () => {
+describe.only('worker: submission.attachment.update', () => {
   it('should do nothing if the attachment is not a client audit', testService((service, container) =>
     service.login('alice', (asAlice) =>
       asAlice.post('/v1/projects/1/forms?publish=true')
diff --git a/test/integration/worker/worker.js b/test/integration/worker/worker.js
index 8003c24b..c8c613a8 100644
--- a/test/integration/worker/worker.js
+++ b/test/integration/worker/worker.js
@@ -8,7 +8,7 @@ const { workerQueue } = require(appRoot + '/lib/worker/worker');
 const { Audit } = require(appRoot + '/lib/model/frames');
 const { insert } = require(appRoot + '/lib/util/db');

-describe('worker', () => {
+describe.only('worker', () => {
   describe('runner @slow', () => {
     // we know reschedule is getting called at some point in these flows because
     // these tests would hang otherwise.
ktuite commented 1 month ago

Reopening because the merged PR does not fix the test failure, but does: