Closed florianm closed 3 years ago
well, certainly we can try bumping up the versions we're running on our next release and see if that fixes the issue, but without a consistent way to reproduce the problem there's not a whole lot else we're going to be able to do with any kind of intelligence.
Upgrading versions sounds good.
In the mean time, I could test my server (ODK Central master as of 2h ago) and see whether I can increase the db connections significantly. That could lead us to the source of the stale db connections.
Is there a way for me to see the knex logs on the running server?
Server logs:
Mobile devices auto-checking for new forms (ca 50 devices, checking every 15 mins):
<IP> - - [23/Nov/2020:05:06:06 +0000] "GET /v1/key/<key>/projects/1/formList HTTP/1.1" 200 3376 "-" "org.odk.collect.android/v1.28.4 Dalvik/2.1.0 (Linux; U; Android 10; SM-T290 Build/QP1A.190711.020)"
ruODK package tests: 92 HTTP requests within about a minute (times 15 environments when run through CI)
Thanks as always for the detailed report, @florianm, and sorry you're running into this. @issa-tseng and I spent quite a bit of time hunting down a similar issue over the summer. @issa-tseng ended up writing a knex patch for v0.21.3: https://github.com/knex/knex/pull/3900. We currently use v0.21.4 as of Central v1.0.0. So as you say, you are likely still vulnerable to this issue.
Though I see that knex has seen a few updates since then, unfortunately none seem related to this.
Do you know how long the server had been up before this occurred?
No trouble, @ln! I remember seeing that the ODK Central with the clogged pool had been up for four days. Our servers get auto-patched and restarted on Wednesdays so that would explain the uptime.
Usage since reboot:
Is any of the above likely to cause the congestion?
Our servers get auto-patched and restarted on Wednesdays
👍
Is any of the above likely to cause the congestion?
It's likely some kind of special condition that happens rarely and you're hitting it because the level of load on your server just provides more opportunities. It's most likely to be related to actions that involve downloading data (ruODK tests, ETL, colleague access). The last knex bug seemed to affect Central when the client connection was closed while requesting submissions. For example, that could be if a user requested a submission table view and closed the browser window before the data fully reached the browser or if a user accessed the OData feed from an unreliable connection. It would have to happen ~30 times for the connection pool to fill. I don't think it's likely that the cause is identical but maybe this gives you a feel for how difficult it might be to hunt down.
Would you say the server has been receiving this level of load for some time?
I doubt there will be useful logs to look at but there may be some database queries that could give us insights. We will discuss and get back to you.
In the mean time, I think it would be somewhat insightful to see whether it happens again in a week period if you're up for leaving the server up and being ready to restart in case of issue. Alternately, if you can't afford any downtime, your best bet is to schedule another service restart Saturday or so. You may also consider separating ruODK unit tests from production and perhaps we could talk about a getodk
box for that.
Thanks for the context @lognaturel!
My ETL scripts download all submissions from the server via OData, then media attachments via REST, and the R call (using package httr) has retry set to three attempts.
The ruODK unit tests seem to run without hiccups but still default to three retries.
I'm OK to monitor the server and restart when needed. (Reminds me of that custom Sentry host setting I wanted to PR so I could error log to our own Sentry instance.)
Let me know if there's a debug process or SQL log I could attempt from my end.
Interesting side effect: We set our ODK Collect to auto-submit, then delete forms. With the db congestion on the ODK Central side, the tablets were uploading about one submission per hour - they must have auto-retried all the time. We now see "stuck" forms i.e. the internal db is out of sync with the file system. My local coordinator reports e.g.:
Tablet 1 – when you plug it in and look in the instances folder, it’s empty, yet the tablet says there are 4 stuck forms.
Tablet 2 – when you plug it in, there are 3 folders within the instances folder and I cannot copy them individually or together. The tablet says there are 8 stuck forms.
KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? at Client_MySQL.acquireConnection (/usr/local/rand/api/node_modules/knex/lib/client.js:347:26) {sql: undefined, bindings: undefined}
@dadaocongsanvietnam I think you're either looking for the knex Github project or for Stackoverflow -- Central does not use MySQL.
@florianm Does your silence on this mean it did not happen again despite a busy week of data collection? That would certainly be good news.
At some point it would be great to learn more about your external database. What version of postgres? How is it hosted and configured? In particular, is it something like RDS or another managed system? I learned from @issa-tseng that some of these are "Postgres-compatible" and may not actually have identical implementations. In particular, it seems like RDS has differences in how transactions are implemented.
We now see "stuck" forms i.e. the internal db is out of sync with the file system
As far as you can tell, was there any data lost? Or was it that some things that should have been cleaned up weren't?
@ln sorry for the lag, just catching up now. We haven't had the problem again.
Our db is a dedicated Azure PostgreSQL 11.6 instance, unsure of size/specs. So it's the real deal, not a compatible derivate.
Re incomplete submissions, I can't tell whether data was lost from my remote diagnostics. We do get some "ghost submissions" sometimes and have to admin > reset Collect to clear the counters in the top level menu.
At this point, I can't offer reproducible conditions and haven't encountered the issue again. I'll chase up details of the hosted db and report back here.
I haven't run into the problem since and will close this issue as resolved. The context and explanations from @ln will be very valuable for others possibly getting stuck with the same problem.
That's great to hear! And the server has been under the same load?
No one reported any problems, so far so good! The server is one of two production servers, currently running:
versions:
647569c54f6bbf26ea356eca0d14f7e5d1a89c6b
cddb691e40e84aabff87b9d427e22a50282d6f99 client (v1.1.2)
a33bc6fb3c34fe38894b0e9d0bb404f81da325e6 server (v1.1.1)
Our ETL runs near daily and scrapes
<ODKC Turtle Data> accessed on 2021-03-08 20:12:20
Areas: 15
Sites: 121
Survey start points: 1315
Survey end points: 1275
Marine Wildlife Incidents (rescues, strandings): 144
Live sightings: 2
Turtle Tracks or Nests: 37556
Turtle Track Tallies: 2
with all attachments (downloads only if new). There are a handful of other projects with far fewer submissions and access traffic as well. ruODK unit tests run daily. I would expect fewer users to log in day to day now, as all data pipelines are automated.
Problem
Users can't log into ODK Central 1.0.1, login times out with a diplomatic "the server received an invalid error" message.
docker-compose logs --tail=100 -f
shows on login attempts:That looks like too many open db connections timing out.
Environment
ODK Central 1.0.1 running via docker-compose using custom db and mail server. The server is used for ruODK unit tests from GH Actions and Appveyor. This means that 15 ruODK instances send the same 461 unit tests at 1200 daily. The requests from the instances are staggered through the differing build times. The server is also used for production campaigns, receiving a few 100 records daily within a few hours in the late morning.
Solution
I didn't have much time to debug this, so I've upgraded and restarted ODK Central which fixed the issue by resetting the db connection pool. I am not sure whether the root cause of the error is address though.
Working versions:
Error search
I'm using an external postgres instance (internal policy, it's backed up, got plenty of storage and grunt). The config has one extra parameter:
"ssl": {"rejectUnauthorized": false}
.ODK Central backend uses knex 0.21
The same issue has been reported by others. A fix has been reported by upgrading knex to 0.21.1 and pg to 8.0.3 here.
The only other mention of the knex connection pool is at https://github.com/getodk/central-backend/issues/255#issuecomment-606228073.
Is this info enough to triage?