[BUG] Timeout on payment website

pgerstbach commented 6 months ago

Describe the bug Every now and then my relay comes into a state where every request to the payment website runs into a permanent timeout. After some time (minutes, hours) it is available again until it produces timeouts again.

During those timeout-times the relay itself is still available via websockets. Events can be stored and read. Only the payment website is affected.

To Reproduce I can't reproduce it with any specific steps, but it happens daily. After a restart it works again, until the timout happens again.

Expected behavior Payment webpage stays available.

System

OS: Ubuntu 22.04.3 LTS
Platform: docker, NGINX running on the host
Version: v1.25.2, actually current head with last commit Sep 9, 2023

Logs There is no specific log, but the following lines appear very often, but not always in proximity in time of the failure:

nostream          | Unable to connect to Redis. Error: read ECONNRESET
nostream          |     at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
nostream          |   errno: -104,
nostream          |   code: 'ECONNRESET',
nostream          |   syscall: 'read'
nostream          | }
nostream          | Unable to connect to Redis. SocketClosedUnexpectedlyError: Socket closed unexpectedly
nostream          |     at Socket.<anonymous> (/app/node_modules/@redis/client/dist/lib/client/socket.js:181:1>
nostream          |     at Object.onceWrapper (node:events:628:26)
nostream          |     at Socket.emit (node:events:513:28)
nostream          |     at TCP.<anonymous> (node:net:322:12)
nostream          | web-socket-adapter: unable to handle message: SocketClosedUnexpectedlyError: Socket closed>
nostream          |     at Socket.<anonymous> (/app/node_modules/@redis/client/dist/lib/client/socket.js:181:1>
nostream          |     at Object.onceWrapper (node:events:628:26)
nostream          |     at Socket.emit (node:events:513:28)
nostream          |     at TCP.<anonymous> (node:net:322:12)

(the last 5 lines are repeated 100s of times within 1 second)

Additional context Since I have a monitoring in place, I can tell at what time (within 5mins) the site is not available anymore. Ist there any specifiy debuggin setting I can switch on?

pgerstbach commented 6 months ago

Okay, so far, I see the following:

docker logs -f nostream-cache
1:C 29 Dec 2023 10:20:16.148 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 29 Dec 2023 10:20:16.148 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 29 Dec 2023 10:20:16.148 # Configuration loaded
1:M 29 Dec 2023 10:20:16.156 # Server initialized
1:C 29 Dec 2023 10:48:02.237 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 29 Dec 2023 10:48:02.237 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 29 Dec 2023 10:48:02.237 # Configuration loaded
1:M 29 Dec 2023 10:48:02.244 # Server initialized
1:C 29 Dec 2023 12:15:14.797 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 29 Dec 2023 12:15:14.797 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 29 Dec 2023 12:15:14.797 # Configuration loaded
1:M 29 Dec 2023 12:15:14.800 # Server initialized

Redis seems to restart now and then because of a out-of-memory error:

Dec 29 10:32:36 at kernel: [62794.438636]  out_of_memory+0x106/0x2e0
Dec 29 10:32:36 at kernel: [62794.438930] Tasks state (memory values in pages):
Dec 29 10:32:36 at kernel: [62794.439403] Out of memory: Killed process 1142634 (redis-server) total-vm:27478268kB, anon-rss:11599084kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:47208kB oom_score_adj:0
Dec 29 10:32:40 at start[1117079]: nostream          | uncaught error: [ErrorReply: LOADING Redis is loading the dataset in memory]
Dec 29 10:32:41 at start[1117079]: nostream          | web-socket-adapter: unable to handle message: [ErrorReply: LOADING Redis is loading the dataset in memory]
Dec 29 10:32:42 at start[1117079]: nostream          | web-socket-adauncaught error: [ErrorReply: LOADING Redis is loading the dataset in memory]
Dec 29 10:32:42 at start[1117079]: nostream          |     at TCP.uncaught error: [ErrorReply: LOADING Redis is loading the dataset in memory]
Dec 29 10:32:42 at start[1117079]: nostream          |     at TCP.onStreamRead (node:internal/stream_base_commons:web-socket-adapter: unable to handle message: [ErrorReply: LOADING Redis is loading the dataset in memory]
Dec 29 10:32:42 at start[1117079]: nostream          | web-socket-adapter: unable to handle message: [ErrorReply: LOADING Redis is loading the dataset in memory]

Memory is fine, most of the time, this is a typical value:

               total        used        free      shared  buff/cache   available
Mem:           15989        1283       13958         154         747       14264
Swap:          16383         797       15586

But sometimes it looks like this:

               total        used        free      shared  buff/cache   available
Mem:           15989       15499         159         174         330          68
Swap:          16383        2054       14329

This is the output of top, sorted by memory shortly after the above output:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND       
1263844 lxd       20   0   18.7g  12.6g   1064 R  95.7  80.5   5:08.09 redis-server  
1252985 xxx       20   0 1466820   1.1g   3924 S  13.0   6.8  13:21.33 node          
1279048 lxd       20   0 4416636 138436 126904 S   0.0   0.8   0:01.27 postgres

So it seems redis is sometimes allocating so much memory that the process is killed. How can I limit the memory use? Redis has a max-memory setting, but how do I set it in the nostream docker environment. What this actually help?

cameri commented 6 months ago

@pgerstbach We had a conversation off Github where we addressed the OOM issue with Redis as well as a connection issue due to NGINX. Do you mind updating here what was the solution?

pgerstbach commented 6 months ago

@pgerstbach We had a conversation off Github where we addressed the OOM issue with Redis as well as a connection issue due to NGINX. Do you mind updating here what was the solution?

Certainly, I just waited long enough to see the issue gone!

So, here is a documentation how my issue was resolved. This would not be possible without @Cameri, so thanks a lot for your help!

Solution Part One: Limit memory use

In my syslog I found multiple out of memory errors like this:

out_of_memory+0x106/0x2e0
Tasks state (memory values in pages):
Out of memory: Killed process 640637 (redis-server) total-vm:8235260kB, anon-rss:7209804kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:14252kB oom_score_adj:0

So redis reserved more memory than available, even an additional 16GB swap file was not enough in some cases. The solution of this memory problem is to reduce "rate limiting" in settings.yaml. The redis cache is currently used only to implement those rate limits. There are limits on invoices, connections, events and messages possible (and enabled by default).

The longer the period used, the larger the memory consumption. I still had some ratelimits with periods of days in my config. After removing them the memory consumption was down to a normal level. If needed also ratelimits with periods of hour(s) should be removed.

After removing the ratelimits with longest period my out-of-memory erros where gone. But still I had outages, so I found one more configuration issue:

Solution Part Two: Configuring nginx accordingly

I realized that my nginx instance was not configered properly. It was running out of file and connection limits. In my nginx.conf I added the following lines:

worker_rlimit_nofile 8192;
events {
        worker_connections 4096;
}

From the nginx docs: https://nginx.org/en/docs/ngx_core_module.html

worker_rlimit_nofile: Changes the limit on the maximum number of open files (RLIMIT_NOFILE) for worker processes.
worker_connections: Sets the maximum number of simultaneous connections that can be opened by a worker process.

Of course the values need to be aligned to your server resources and requirements: YMMV

So my issue was not really related to nostream, but maybe you have similar issues and hopefuly my solution will also help others! 🤙

cameri / nostream

[BUG] Timeout on payment website #346

Solution Part One: Limit memory use

Solution Part Two: Configuring nginx accordingly