Closed matthew-white closed 1 month ago
I'm noticing a pattern that points to the possibility of something involving S3. When tests hang, I see that the following 6 tests fail. I've noticed that both locally and in CircleCI. (You can see these errors in the CircleCI build that I linked to above.)
✔ should not allow failure to affect previous or future uploads
Uploading 0 blobs...
[2024-10-02T18:03:18.325Z] Upload completed.
1) should not attempt to upload an in-progress blob
with limit
(node:11432) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
2) should upload requested number of blobs, and ignore others
3) should not complain if blob count is less than limit
(node:11432) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
4) should upload all blobs if limit is zero
...
✔ should process the given logs (47ms)
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "POST /v1/sessions HTTP/1.1" 200 228
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "POST /v1/projects/1/forms?publish=true HTTP/1.1" 200 436
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:29 +0000] "GET /v1/projects/1/forms/audits/submissions/one/attachments HTTP/1.1" 200 36
5) should process the given logs if already uploaded to s3
::ffff:127.0.0.1 - - [02/Oct/2024:18:03:31 +0000] "POST /v1/sessions HTTP/1.1" 200 228
6) should not reprocess already-processed logs
In other words, tests hang intermittently, and when that happens, it's preceded by intermittent test failures.
The first 4 tests are from test/integration/task/s3.js, and the last 2 are from test/integration/worker/submission.attachment.update.js.
This is what the tests look like when they pass. Note that MaxListenersExceededWarning
is logged even when the tests pass, so I don't think that's necessarily part of the problem here. One difference that I am seeing is that when the tests pass, there's "Uploading 1 blobs...", but when they fail, there's just "Uploading 0 blobs..."
✔ should not allow failure to affect previous or future uploads
Uploading 1 blobs...
Uploading 0 blobs...
[2024-10-02T18:58:31.898Z] Upload completed.
[2024-10-02T18:58:31.903Z] Upload completed.
✔ should not attempt to upload an in-progress blob (210ms)
with limit
(node:46370) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
✔ should upload requested number of blobs, and ignore others
✔ should not complain if blob count is less than limit
(node:46370) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
✔ should upload all blobs if limit is zero
...
✔ should process the given logs (64ms)
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/sessions HTTP/1.1" 200 228
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/forms?publish=true HTTP/1.1" 200 436
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "GET /v1/projects/1/forms/audits/submissions/one/attachments HTTP/1.1" 200 36
✔ should process the given logs if already uploaded to s3 (52ms)
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/sessions HTTP/1.1" 200 228
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/forms?publish=true HTTP/1.1" 200 436
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "GET /v1/projects/1/forms/audits/submissions/one/attachments HTTP/1.1" 200 36
::ffff:127.0.0.1 - - [02/Oct/2024:18:58:36 +0000] "POST /v1/projects/1/submission HTTP/1.1" 201 167
✔ should not reprocess already-processed logs (57ms)
@alxndrsn, would you mind taking a look when you get the chance?
I don't see that we changed anything in lib/worker/worker.js in v2024.2: https://github.com/getodk/central-backend/compare/v2024.1.0...v2024.2.0.
The first failing test uses testTaskFullTrx()
, which was introduced in that window:
To reproduce locally:
diff --git a/test/integration/task/s3.js b/test/integration/task/s3.js
index 2e7b8789..70be203a 100644
--- a/test/integration/task/s3.js
+++ b/test/integration/task/s3.js
@@ -25,7 +25,7 @@ const assertThrowsAsync = async (fn, expected) => {
}
};
-describe('task: s3', () => {
+describe.only('task: s3', () => {
describe('s3 disabled', () => {
it('uploadPending() should fail', async () => {
await assertThrowsAsync(() => uploadPending(), 'S3 blob support is not enabled.');
@@ -179,7 +179,7 @@ describe('task: s3', () => {
});
original.apply(global.s3, args);
};
- await aBlobExistsWith(container, { status: 'pending' });
+ //await aBlobExistsWith(container, { status: 'pending' });
// when
const first = uploadPending();
diff --git a/test/integration/worker/submission.attachment.update.js b/test/integration/worker/submission.attachment.update.js
index fa419ff0..8b425871 100644
--- a/test/integration/worker/submission.attachment.update.js
+++ b/test/integration/worker/submission.attachment.update.js
@@ -5,7 +5,7 @@ const { testService } = require('../setup');
const testData = require(appRoot + '/test/data/xml.js');
const worker = require(appRoot + '/lib/worker/submission.attachment.update');
-describe('worker: submission.attachment.update', () => {
+describe.only('worker: submission.attachment.update', () => {
it('should do nothing if the attachment is not a client audit', testService((service, container) =>
service.login('alice', (asAlice) =>
asAlice.post('/v1/projects/1/forms?publish=true')
diff --git a/test/integration/worker/worker.js b/test/integration/worker/worker.js
index 8003c24b..c8c613a8 100644
--- a/test/integration/worker/worker.js
+++ b/test/integration/worker/worker.js
@@ -8,7 +8,7 @@ const { workerQueue } = require(appRoot + '/lib/worker/worker');
const { Audit } = require(appRoot + '/lib/model/frames');
const { insert } = require(appRoot + '/lib/util/db');
-describe('worker', () => {
+describe.only('worker', () => {
describe('runner @slow', () => {
// we know reschedule is getting called at some point in these flows because
// these tests would hang otherwise.
Reopening because the merged PR does not fix the test failure, but does:
In CircleCI,
make test-ci
sometimes hangs, for example, here. It always seems to hang at the same place:I'm also seeing this happen intermittently when I run
make test
locally.The following is the last test to pass and the test that follows it. That said, I'm not sure whether the issue is with the test that follows, with some Mocha after-hook, or with something else.
https://github.com/getodk/central-backend/blob/82696a053c87eeb30319c16193223a5fad7b83cf/test/integration/worker/worker.js#L39-L69
I don't see that we changed anything in lib/worker/worker.js in v2024.2: https://github.com/getodk/central-backend/compare/v2024.1.0...v2024.2.0. That makes me suspect that the problem isn't with the worker queue or with the test. However, there is this comment in the test file that might be related:
https://github.com/getodk/central-backend/blob/82696a053c87eeb30319c16193223a5fad7b83cf/test/integration/worker/worker.js#L13-L14