KnextTimeoutError: The pool is full

florianm commented 3 years ago

Problem

Users can't log into ODK Central 1.0.1, login times out with a diplomatic "the server received an invalid error" message. docker-compose logs --tail=100 -f shows on login attempts:

}KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
service               |     at Client_PG.acquireConnection (/usr/odk/node_modules/knex/lib/client.js:349:26) {
service               |   name: 'KnexTimeoutError',
service               |   sql: undefined,
service               |   bindings: undefined

That looks like too many open db connections timing out.

Environment

ODK Central 1.0.1 running via docker-compose using custom db and mail server. The server is used for ruODK unit tests from GH Actions and Appveyor. This means that 15 ruODK instances send the same 461 unit tests at 1200 daily. The requests from the instances are staggered through the differing build times. The server is also used for production campaigns, receiving a few 100 records daily within a few hours in the late morning.

Solution

I didn't have much time to debug this, so I've upgraded and restarted ODK Central which fixed the issue by resetting the db connection pool. I am not sure whether the root cause of the error is address though.

Working versions:

versions:
70c200232acfca99da484bb5f15f67e1f5857c90
 4732f7112a286165241aaf7f971f2c2e38d6bb8a client (v1.0.0)
 e9ffd2c0c3aa1a9475852e1397b8259e2b03165a server (v1.0.3)

Error search

I'm using an external postgres instance (internal policy, it's backed up, got plenty of storage and grunt). The config has one extra parameter: "ssl": {"rejectUnauthorized": false}.

"database": {
      "host": "${DBHOST}",
      "user": "${DBUSER}",
      "password": "${DBPASS}",
      "database": "${DBNAME}","
      "ssl": {"rejectUnauthorized": false}
    },

ODK Central backend uses knex 0.21

The same issue has been reported by others. A fix has been reported by upgrading knex to 0.21.1 and pg to 8.0.3 here.

The only other mention of the knex connection pool is at https://github.com/getodk/central-backend/issues/255#issuecomment-606228073.

Is this info enough to triage?

issa-tseng commented 3 years ago

well, certainly we can try bumping up the versions we're running on our next release and see if that fixes the issue, but without a consistent way to reproduce the problem there's not a whole lot else we're going to be able to do with any kind of intelligence.

florianm commented 3 years ago

Upgrading versions sounds good.

In the mean time, I could test my server (ODK Central master as of 2h ago) and see whether I can increase the db connections significantly. That could lead us to the source of the stale db connections.

Is there a way for me to see the knex logs on the running server?

florianm commented 3 years ago

Server logs:

Mobile devices auto-checking for new forms (ca 50 devices, checking every 15 mins):

<IP> - - [23/Nov/2020:05:06:06 +0000] "GET /v1/key/<key>/projects/1/formList HTTP/1.1" 200 3376 "-" "org.odk.collect.android/v1.28.4 Dalvik/2.1.0 (Linux; U; Android 10; SM-T290 Build/QP1A.190711.020)"

ruODK package tests: 92 HTTP requests within about a minute (times 15 environments when run through CI)

``` - [23/Nov/2020:05:07:56 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1495 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:07:57 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290006239.jpg HTTP/1.1" 200 354798 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:07:58 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290006239.jpg HTTP/1.1" 200 354798 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:07:58 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290049411.jpg HTTP/1.1" 200 379122 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:07:59 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04-att/submissions.csv.zip HTTP/1.1" 200 319747 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:07:59 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04-att/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:00 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions HTTP/1.1" 200 276 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:00 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments HTTP/1.1" 200 216 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:01 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments HTTP/1.1" 200 216 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:01 +0000] "GET /v1/audits HTTP/1.1" 200 596879 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:02 +0000] "GET /v1/audits?action=project.update&start=2019-08-01Z&end=2019-08-31Z&limit=100&offset=0 HTTP/1.1" 200 2 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:02 +0000] "GET /v1/audits?action=project.update&limit=100&offset=0 HTTP/1.1" 200 4727 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:03 +0000] "GET /v1/projects/2/forms HTTP/1.1" 200 7435 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:03 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04 HTTP/1.1" 200 772 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:03 +0000] "GET /v1/projects/2/forms HTTP/1.1" 200 7435 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:04 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:04 +0000] "GET /v1/projects/2/forms/Locations_draft/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 2 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:05 +0000] "GET /v1/projects/2/forms/Locations_draft/draft/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 1586 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:05 +0000] "GET /v1/projects/2/forms/Locations_draft/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 2 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:06 +0000] "GET /v1/projects/2/forms/Locations_draft/draft/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 1586 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:06 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:06 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.xml HTTP/1.1" 200 37387 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:07 +0000] "GET /v1/projects/2/forms/I8n_no_lang/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 4070 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:07 +0000] "GET /v1/projects/2/forms/I8n_no_lang.xml HTTP/1.1" 200 14683 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:08 +0000] "GET /v1/projects/2/forms/I8n_label_lng/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 694 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:08 +0000] "GET /v1/projects/2/forms/I8n_label_lng.xml HTTP/1.1" 200 5958 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:09 +0000] "GET /v1/projects/2/forms/I8n_label_choices/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 694 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:09 +0000] "GET /v1/projects/2/forms/I8n_label_choices.xml HTTP/1.1" 200 7003 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:10 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:10 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.xml HTTP/1.1" 200 37387 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:11 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.xml HTTP/1.1" 200 37387 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:11 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.xml HTTP/1.1" 200 37387 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:12 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290006239.jpg HTTP/1.1" 200 354798 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:12 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290049411.jpg HTTP/1.1" 200 379122 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:13 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290379613.jpg HTTP/1.1" 200 403854 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:14 +0000] "GET /v1/projects/2/forms/Locations.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 2631 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:14 +0000] "GET /v1/projects/2/forms/Locations.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 2631 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:14 +0000] "GET /v1/projects/2/forms/Locations/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 1586 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:15 +0000] "GET /v1/projects/2/forms/Locations.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=true HTTP/1.1" 200 2100 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:15 +0000] "GET /v1/projects/2/forms/Locations.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=true HTTP/1.1" 200 2100 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:16 +0000] "GET /v1/projects/2/forms/Locations/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 1586 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:17 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/$metadata HTTP/1.1" 200 6897 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:17 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc HTTP/1.1" 200 365 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:18 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1495 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:18 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:19 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1495 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:19 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1495 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:19 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:20 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290006239.jpg HTTP/1.1" 200 354798 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:21 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290049411.jpg HTTP/1.1" 200 379122 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:21 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28/attachments/1604290379613.jpg HTTP/1.1" 200 403854 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:22 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc HTTP/1.1" 200 365 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:22 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions.vegetation_stratum?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1193 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:23 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:23 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions.vegetation_stratum?%24skip=1&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 839 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:24 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:24 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions?%24skip=&%24top=1&%24count=false&%24wkt=false HTTP/1.1" 200 1495 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:25 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:25 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04.svc/Submissions?%24skip=&%24top=1&%24count=true&%24wkt=true HTTP/1.1" 200 1292 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:26 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04-gap.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1938 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:26 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04-gap.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 1938 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:26 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04-gap/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 3039 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:28 +0000] "GET /v1/projects/2/forms/I8n_label_choices/fields?flatten=FALSE&odata=FALSE HTTP/1.1" 200 694 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:28 +0000] "GET /v1/projects/2/forms/I8n_label_choices.svc/Submissions?%24skip=&%24top=&%24count=false&%24wkt=false HTTP/1.1" 200 989 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:29 +0000] "GET /v1/projects/2 HTTP/1.1" 200 855 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:29 +0000] "GET /v1/projects/111111 HTTP/1.1" 404 76 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:31 +0000] "GET /v1/projects/111111 HTTP/1.1" 404 76 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:33 +0000] "GET /v1/projects/111111 HTTP/1.1" 404 76 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - wrong_username [23/Nov/2020:05:08:36 +0000] "GET /v1/projects/2 HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - wrong_username [23/Nov/2020:05:08:37 +0000] "GET /v1/projects/2 HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - wrong_username [23/Nov/2020:05:08:39 +0000] "GET /v1/projects/2 HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:39 +0000] "GET /v1/projects/2 HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:41 +0000] "GET /v1/projects/2 HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:44 +0000] "GET /v1/projects/2 HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:45 +0000] "GET /v1/projects HTTP/1.1" 200 857 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - wrong_username [23/Nov/2020:05:08:50 +0000] "GET /v1/projects HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - wrong_username [23/Nov/2020:05:08:51 +0000] "GET /v1/projects HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - wrong_username [23/Nov/2020:05:08:53 +0000] "GET /v1/projects HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:54 +0000] "GET /v1/projects HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:55 +0000] "GET /v1/projects HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:56 +0000] "GET /v1/projects HTTP/1.1" 401 80 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:58 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions HTTP/1.1" 200 276 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:58 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28 HTTP/1.1" 200 274 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:58 +0000] "GET /v1/projects/2/forms/Spotlighting-06/submissions.csv.zip HTTP/1.1" 200 730 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:59 +0000] "GET /v1/projects/2/forms/Spotlighting-06/submissions.csv.zip HTTP/1.1" 200 730 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:08:59 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions HTTP/1.1" 200 276 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:09:00 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28.xml HTTP/1.1" 200 2984 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:09:00 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions/uuid:469f71d3-d7aa-4c74-8aaa-af5f667a2f28.xml HTTP/1.1" 200 2984 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:09:00 +0000] "GET /v1/projects/2/forms/Flora-Quadrat-04/submissions HTTP/1.1" 200 276 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" - [23/Nov/2020:05:09:01 +0000] "GET /v1/projects/2/forms HTTP/1.1" 200 7435 "-" "libcurl/7.68.0 r-curl/4.3 httr/1.4.2" ```

lognaturel commented 3 years ago

Thanks as always for the detailed report, @florianm, and sorry you're running into this. @issa-tseng and I spent quite a bit of time hunting down a similar issue over the summer. @issa-tseng ended up writing a knex patch for v0.21.3: https://github.com/knex/knex/pull/3900. We currently use v0.21.4 as of Central v1.0.0. So as you say, you are likely still vulnerable to this issue.

Though I see that knex has seen a few updates since then, unfortunately none seem related to this.

Do you know how long the server had been up before this occurred?

florianm commented 3 years ago

No trouble, @ln! I remember seeing that the ODK Central with the clogged pool had been up for four days. Our servers get auto-patched and restarted on Wednesdays so that would explain the uptime.

Usage since reboot:

The ruODK unit tests ran daily from CI (92 * 15 requests), a few more times on PR and from local dev.
The prod campaigns have uploaded a few hundred submissions daily.
About 40 devices check every 15 mins for fresh form versions.
My ETL scripts scrape the server daily, download all submissions, and new media files.
My colleague accesses the server through Excel / OData.

Is any of the above likely to cause the congestion?

lognaturel commented 3 years ago

Our servers get auto-patched and restarted on Wednesdays

👍

Is any of the above likely to cause the congestion?

It's likely some kind of special condition that happens rarely and you're hitting it because the level of load on your server just provides more opportunities. It's most likely to be related to actions that involve downloading data (ruODK tests, ETL, colleague access). The last knex bug seemed to affect Central when the client connection was closed while requesting submissions. For example, that could be if a user requested a submission table view and closed the browser window before the data fully reached the browser or if a user accessed the OData feed from an unreliable connection. It would have to happen ~30 times for the connection pool to fill. I don't think it's likely that the cause is identical but maybe this gives you a feel for how difficult it might be to hunt down.

Would you say the server has been receiving this level of load for some time?

I doubt there will be useful logs to look at but there may be some database queries that could give us insights. We will discuss and get back to you.

In the mean time, I think it would be somewhat insightful to see whether it happens again in a week period if you're up for leaving the server up and being ready to restart in case of issue. Alternately, if you can't afford any downtime, your best bet is to schedule another service restart Saturday or so. You may also consider separating ruODK unit tests from production and perhaps we could talk about a getodk box for that.

florianm commented 3 years ago

Thanks for the context @lognaturel!

My ETL scripts download all submissions from the server via OData, then media attachments via REST, and the R call (using package httr) has retry set to three attempts.

The ruODK unit tests seem to run without hiccups but still default to three retries.

I'm OK to monitor the server and restart when needed. (Reminds me of that custom Sentry host setting I wanted to PR so I could error log to our own Sentry instance.)

Let me know if there's a debug process or SQL log I could attempt from my end.

florianm commented 3 years ago

Interesting side effect: We set our ODK Collect to auto-submit, then delete forms. With the db congestion on the ODK Central side, the tablets were uploading about one submission per hour - they must have auto-retried all the time. We now see "stuck" forms i.e. the internal db is out of sync with the file system. My local coordinator reports e.g.:

Tablet 1 – when you plug it in and look in the instances folder, it’s empty, yet the tablet says there are 4 stuck forms.

Tablet 2 – when you plug it in, there are 3 folders within the instances folder and I cannot copy them individually or together. The tablet says there are 8 stuck forms.

phlqp commented 3 years ago

KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? at Client_MySQL.acquireConnection (/usr/local/rand/api/node_modules/knex/lib/client.js:347:26) {sql: undefined, bindings: undefined}

lognaturel commented 3 years ago

@dadaocongsanvietnam I think you're either looking for the knex Github project or for Stackoverflow -- Central does not use MySQL.

lognaturel commented 3 years ago

@florianm Does your silence on this mean it did not happen again despite a busy week of data collection? That would certainly be good news.

At some point it would be great to learn more about your external database. What version of postgres? How is it hosted and configured? In particular, is it something like RDS or another managed system? I learned from @issa-tseng that some of these are "Postgres-compatible" and may not actually have identical implementations. In particular, it seems like RDS has differences in how transactions are implemented.

We now see "stuck" forms i.e. the internal db is out of sync with the file system

As far as you can tell, was there any data lost? Or was it that some things that should have been cleaned up weren't?

florianm commented 3 years ago

@ln sorry for the lag, just catching up now. We haven't had the problem again.

Uptime: 4 days (weekly patching)
ca 7k submissions since server setup (it's turtle season)
ruODK tests still running and accessing Central
ETL running roughly daily at noon local time scraping the entire server
No info on colleague connecting via Excel / OData

Our db is a dedicated Azure PostgreSQL 11.6 instance, unsure of size/specs. So it's the real deal, not a compatible derivate.

Re incomplete submissions, I can't tell whether data was lost from my remote diagnostics. We do get some "ghost submissions" sometimes and have to admin > reset Collect to clear the counters in the top level menu.

At this point, I can't offer reproducible conditions and haven't encountered the issue again. I'll chase up details of the hosted db and report back here.

florianm commented 3 years ago

I haven't run into the problem since and will close this issue as resolved. The context and explanations from @ln will be very valuable for others possibly getting stuck with the same problem.

lognaturel commented 3 years ago

That's great to hear! And the server has been under the same load?

florianm commented 3 years ago

No one reported any problems, so far so good! The server is one of two production servers, currently running:

versions:
647569c54f6bbf26ea356eca0d14f7e5d1a89c6b
 cddb691e40e84aabff87b9d427e22a50282d6f99 client (v1.1.2)
 a33bc6fb3c34fe38894b0e9d0bb404f81da325e6 server (v1.1.1)

Our ETL runs near daily and scrapes

<ODKC Turtle Data> accessed on 2021-03-08 20:12:20
Areas: 15
Sites: 121
Survey start points: 1315
Survey end points: 1275
Marine Wildlife Incidents (rescues, strandings): 144
Live sightings: 2
Turtle Tracks or Nests: 37556
Turtle Track Tallies: 2

with all attachments (downloads only if new). There are a handful of other projects with far fewer submissions and access traffic as well. ruODK unit tests run daily. I would expect fewer users to log in day to day now, as all data pipelines are automated.

getodk / central