botpress / v12

Botpress OSS – v12
https://v12.botpress.com
GNU Affero General Public License v3.0
76 stars 87 forks source link

[BUG] Botpress is not able to connect to redis because the server is restarting #1435

Closed poojapatil12 closed 3 years ago

poojapatil12 commented 3 years ago

Describe the bug

After BP tried to execute an SQL Query and the manager (Patroni) was not able to obtain a lock on the underlying Postgres database, this triggered the manager to restart the database. At that time, we were still trying to acquire a connection, but it failed because the DB was not ready. And that caused the connection timeout

  1. Have the setup: 10 gb ram , 4 cpu , one language server – English , locally run language server, NLU, and duckling, Database is PostgreSQL (1 PostgreSQL r/w with no failover, 1 Botpress kube setup) and BFS_STORAGE is database
  2. Load testing and seeing DB connection timing out:

Error: connect ECONNREFUSED 10.233.11.255:5432 LogDbPersister Error persisting messages [TimeoutError, Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?]

Environment (please complete the following information):

Additional context

PostgreSQL DB LOGS:

2021-08-23 06:39:18.123 UTC [2495] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2021-08-23 06:39:18.124 UTC [15223] LOG: archiver process (PID 15901) exited with exit code 1 2021-08-23 06:39:18.601 UTC [15223] LOG: database system is shut down 2021-08-23 06:39:18,643 INFO: demoted self because failed to update leader lock in DCS 2021-08-23 06:39:18,651 INFO: closed patroni connection to the postgresql cluster 2021-08-23 06:39:18,876 INFO: postmaster pid=32350 /tmp:5432 - no response 2021-08-23 06:39:18.917 UTC [32350] LOG: pgaudit extension initialized 2021-08-23 06:39:18.918 UTC [32350] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit 2021-08-23 06:39:18.918 UTC [32350] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-08-23 06:39:18.954 UTC [32350] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2021-08-23 06:39:18.978 UTC [32350] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432" 2021-08-23 06:39:19.705 UTC [32352] LOG: database system was interrupted; last known up at 2021-08-23 06:35:34 UTC 2021-08-23 06:39:19.708 UTC [32354] FATAL: the database system is starting up 2021-08-23 06:39:19.708 UTC [32353] FATAL: the database system is starting up 2021-08-23 06:39:19.722 UTC [32355] FATAL: the database system is starting up 2021-08-23 06:39:19.726 UTC [32356] FATAL: the database system is starting up 2021-08-23 06:39:19.831 UTC [32357] FATAL: the database system is starting up 2021-08-23 06:39:19.890 UTC [32359] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:19.896 UTC [32361] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:20.236 UTC [32362] FATAL: the database system is starting up 2021-08-23 06:39:20.249 UTC [32363] FATAL: the database system is starting up 2021-08-23 06:39:20.252 UTC [32364] FATAL: the database system is starting up 2021-08-23 06:39:20.904 UTC [32366] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:21.287 UTC [32367] FATAL: the database system is starting up 2021-08-23 06:39:21.287 UTC [32368] FATAL: the database system is starting up 2021-08-23 06:39:21.296 UTC [32369] FATAL: the database system is starting up 2021-08-23 06:39:21.301 UTC [32370] FATAL: the database system is starting up 2021-08-23 06:39:21.378 UTC [32371] FATAL: the database system is starting up 2021-08-23 06:39:21.388 UTC [32372] FATAL: the database system is starting up 2021-08-23 06:39:21.394 UTC [32373] FATAL: the database system is starting up 2021-08-23 06:39:21.398 UTC [32374] FATAL: the database system is starting up 2021-08-23 06:39:21.414 UTC [32375] FATAL: the database system is starting up 2021-08-23 06:39:21.819 UTC [32376] FATAL: the database system is starting up 2021-08-23 06:39:21.887 UTC [32377] FATAL: the database system is starting up 2021-08-23 06:39:21.903 UTC [32378] FATAL: the database system is starting up 2021-08-23 06:39:21.906 UTC [32379] FATAL: the database system is starting up 2021-08-23 06:39:21.912 UTC [32381] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:22.242 UTC [32382] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:22.921 UTC [32384] FATAL: the database system is starting up 2021-08-23 06:39:23.242 UTC [32385] FATAL: the database system is starting up 2021-08-23 06:39:23.334 UTC [32386] FATAL: the database system is starting up 2021-08-23 06:39:23.817 UTC [32387] FATAL: the database system is starting up 2021-08-23 06:39:23.929 UTC [32389] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:24,463 INFO: Lock owner: pg-db-cluster-55f97f9dc8-ctqxk; I am pg-db-cluster-55f97f9dc8-ctqxk 2021-08-23 06:39:24,527 INFO: updated leader lock during starting after demotion 2021-08-23 06:39:24.937 UTC [32391] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:25.269 UTC [32407] FATAL: the database system is starting up 2021-08-23 06:39:25.946 UTC [32409] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:26.954 UTC [32411] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:27.962 UTC [32413] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:28.971 UTC [32415] FATAL: the database system is starting up /tmp:5432 - rejecting connections /tmp:5432 - rejecting connections 2021-08-23 06:39:29.979 UTC [32417] FATAL: the database system is starting up 2021-08-23 06:39:30.987 UTC [32419] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:31.996 UTC [32421] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:32.077 UTC [32444] FATAL: the database system is starting up /tmp:5432 - rejecting connections 2021-08-23 06:39:33.005 UTC [32446] FATAL: the database system is starting up 2021-08-23 06:39:33.610 UTC [32447] FATAL: the database system is starting up 2021-08-23 06:39:33,610 WARNING: Retry got exception: 'connection problems' 2021-08-23 06:39:34.013 UTC [32471] FATAL: the database system is starting up /tmp:5432 - rejecting connections

Botpress Logs:

STACK TRACE VError: [DB Driver] Error checking if file exists "data/bots/cpadv-3761/bot.config.json": connect ECONNREFUSED 10.233.11.255:5432 at DBStorageDriver.fileExists (/snapshot/build-linux/out/bp/core/bpfs/drivers/db-driver.js) 08/22/2021 07:21:45.416 BotService Bot configuration file not found for bot "cpadv-3761" [VError, [DB Driver] Error checking if file exists "data/bots/cpadv-3761/bot.config.json": connect ECONNREFUSED 10.233.11.255:5432] STACK TRACE VError: [DB Driver] Error checking if file exists "data/bots/cpadv-3761/bot.config.json": connect ECONNREFUSED 10.233.11.255:5432 at DBStorageDriver.fileExists (/snapshot/build-linux/out/bp/core/bpfs/drivers/db-driver.js) 08/22/2021 07:21:48.487 LogsJanitor connect ECONNREFUSED 10.233.11.255:5432 08/22/2021 07:21:49.804 DialogJanitor Skipping the current run, previous operation still running 08/22/2021 07:21:49.831 LogDbPersister Error persisting messages [Error, connect ECONNREFUSED 10.233.11.255:5432] STACK TRACE Error: connect ECONNREFUSED 10.233.11.255:5432 at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16) 08/22/2021 07:21:52.199 DialogJanitor connect ECONNREFUSED 10.233.11.255:5432 08/22/2021 07:21:56.697 LogDbPersister Error persisting messages [TimeoutError, Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?] STACK TRACE TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? at /snapshot/build-linux/node_modules/knex/lib/client.js:345:17 at tryCatcher (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/util.js:16:23) at /snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/catch_filter.js:17:41 at tryCatcher (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/util.js:16:23) at Promise._settlePromiseFromHandler (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/promise.js:547:31) at Promise._settlePromise (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/promise.js:604:18) at Promise._settlePromise0 (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/promise.js:649:10) at Promise._settlePromises (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/promise.js:725:18) at _drainQueueStep (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/async.js:93:12) at _drainQueue (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/async.js:86:9) at Async._drainQueues (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/async.js:102:5) at Immediate._onImmediate (/snapshot/build-linux/node_modules/knex/node_modules/bluebird/js/release/async.js:15:14) at processImmediate (internal/timers.js:456:21) at process.topLevelDomainCallback (domain.js:137:15) 08/22/2021 07:22:00.839 LogDbPersister Error persisting messages [Error, connect ECONNREFUSED 10.233.11.255:5432] STACK TRACE Error: connect ECONNREFUSED 10.233.11.255:5432 at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16) 08/22/2021 07:22:02.184 BotService Bot configuration file not found for bot "bot-1-load" [VError, [DB Driver] Error checking if file exists "data/bots/bot-1-load/bot.config.json": connect ECONNREFUSED 10.233.11.255:5432] STACK TRACE VError: [DB Driver] Error checking if file exists "data/bots/bot-1-load/bot.config.json": connect ECONNREFUSED 10.233.11.255:5432 at DBStorageDriver.fileExists (/snapshot/build-linux/out/bp/core/bpfs/drivers/db-driver.js) 08/22/2021 07:22:03.209 BotService Bot configuration file not found for bot "bot-1-load" [VError, [DB Driver] Error checking if file exists "data/bots/bot-1-load/bot.config.json": connect ECONNREFUSED 10.233.11.255:5432]

EFF commented 3 years ago

My thoughts :

J-FMartin commented 3 years ago

@poojapatil12 can you confirm that we can close this issue please

poojapatil12 commented 3 years ago

Waiting for the fix for issue 55 to merge

laurentlp commented 3 years ago

@poojapatil12 It was fixed with this PR https://github.com/botpress/studio/pull/105 and available in studio v0.0.37