Find the cause for the nightly latency spikes

DasSkelett commented 3 years ago

Description (What went wrong?):

Every day/night between 23:00 UTC and 05:00 UTC, SpaceDock's server side processing times experience spikes from around 500ms to 2-4s, sometimes even up to 8 seconds. This is measured from a server with a very good connection to SpaceDock's, against the /kerbal-space-program/browse/top endpoint, using the Prometheus Blackbox Exporter.

The measurement is split up in different request steps (see the first picture):

connect: TCP connection setup, pretty fast, always stable – the lowest line in the graph, the legend row for it is not visible in the picture
tls: setting up and negotiating the TLS session, handled by ATS – third line from the bottom, equally stable
processing: the time the client waits for the server to answer the request, i.e. server side processing, goes through all the way to gunicorn/Flask – top line, the problem child, latency spikes and outages
transfer: the time from begin of the response to fully receiving it – second line from below, fails when processing fails

Here's a second graph of the processing time only, divided in buckets, and to a linear scale.

This does not only affect /kerbal-space-program/browse/top, if you visit any other page on SpaceDock during these times you'll experience equally slow response times / long loading times.

Fixing approach

The latency spikes happen during the evening in the US, likely when the request load is highest on SpaceDock. There's one or more bottlenecks we need to find and fix to avoid such high processing times.

We can probably rule out the network bandwidth, otherwise we should see the spikes in "transfer", "tls" and "connect", not "processing". Also alpha is totally fine during these times. It could be the database being overloaded, lock contention or whatever. It could be the template rendering taking so long. It could be gunicorn having too few or too many workers (or gunicorn instances per se, we're still at 6 instances à 8 workers). It could be memory pressure / exhaustion (maybe coupled with too many gunicorn workers). It could be some expensive code path. ...

Profiling

To get us closer to the cause, it would be nice if we could do some profiling of SpaceDock's performance.

There are basically three different ways to do this:

Profile individual requests on local development server

This can (and did) reveal some duplicated database calls, repeatedly called functions that are expensive and could/should be cached, and some other stuff. The data we can get out of this is very limited though, and very far from real production performance data.

Profile alpha/beta using load tests

We could enable some profiler in alpha/beta, hit it hard, and try to find the problem. Could give some hints about where it fails under load, has a lower risk of affecting production. Just hammering a single endpoint won't give us accurate real-world data, trying to simulate real-world traffic will be hard to impossible (at least without knowing how the real-world traffic looks like). Alpha+beta can be accessed by the two primary code maintainers of SpaceDock (me and @HebaruSan), so getting data in and out of there would be very easy and can be done without having to coordinate too much with @V1TA5.

Profile production by sampling x real-world requests per time period (or a set %)

This would give us the most accurate data, matching what the production server actually experiences. Tracing every request would cause performance to drop even more, so we'd need to restrict the profiling to only a few requests every now and then (how often can be discussed once we found a way how to do it). We can also leave this running in the background indefinitely. Since only @V1TA5 (and Darklight) have access to production, changing profiling settings and getting profile data out of it will be difficult. This one would probably require us to make the profiling data available directly as part of SpaceDock, e.g. by either running a web visualizer, or by making it downloadable somehow. Should be admin-locked.

In the end we probably get the most out of it if we have all three possibilities. ( Where local profiling is basically already possible, but we could make it simpler to set up)

Profiling Tools

We already found the following: flask-profiler: https://github.com/muatik/flask-profiler#sampling

Has a very customizable sample rate feature built-in

snakeviz: https://jiffyclub.github.io/snakeviz/

Has inverted flame graphs called icicles
can also list functions sorted by total time consumption or number of calls regardless of call source (i.e. from different places)
runs as web app, could be run as separate application in the production container, or maybe even from inside our backend as separate route, potentially using our existing admin authorization system

HebaruSan commented 3 years ago

I like the idea of making the visualization pages admin-only, since then we could put the new routes at /admin/profiling and /admin/profiling/<int:id>, which would help to keep them out of the way of the normal site routes.

It might even make sense to make profiling the default admin page, because the list of users isn't really that interesting.

HebaruSan commented 3 years ago

On closer inspection, I am not sure that flask-profiler and snakeviz are compatible.

flask-profiler seems to only collect HTTP headers, start time, and elapsed time, which could be useful to a project that wishes to find out which of their routes are slow, but does not provide the kind of stack trace sampling that would be needed for flame/icicle graphs.

snakeviz says that it parses data files generated by cProfiler, which is not what flask-profiler seems to be doing. Maybe there is another plugin that can do it...

Also looked at Flask-Profile / flask.ext.profile, which seems about the same, it just collects route run times rather than stack trace samples.

HebaruSan commented 3 years ago

This one mentions both Flask and cProfiler. https://werkzeug.palletsprojects.com/en/1.0.x/middleware/profiler/ Possibly we could dump data into a directory and then get a listing of those files for the visualization page. But so far I don't see how to turn it on and off for different requests; it seems that it would profile everything once you enable it.

HebaruSan commented 3 years ago

snakeviz seems to depend on IPython / Jupyter, so it may not be able to generate visualizations in a flask process (even though it says it is "a browser based graphical viewer").

HebaruSan commented 3 years ago

More chances to get my hopes up:

This one turns a .prof file from cProfile into an .svg, maybe we could run that in a scheduled task to pre-render all the data we've accumulated, then list the .svg files in the profiling admin route; or it even has a mode where it outputs svgs directly into the profiling directory!:

https://github.com/baverman/flameprof/

HebaruSan commented 3 years ago

https://github.com/HebaruSan/SpaceDock/tree/feature/profiling

listing

flamegraph

DasSkelett commented 3 years ago

For the record on GitHub, profiling works great, and we've already done some huge optimizations and some are still in the pipeline (#345, #370), the nightly slowdowns are not solved, instead they seen to be even worse.

They very heavily look like they're caused by the storage for the mod zips, backgrounds and thumbnails misbehaving (an SMB/CIFS mount on the host, then bind-mounted into the container, I think). This results in very high server load, slowing down other application, but also freezes disk access itself: (Note that this requst took 14.5 seconds to complete)

DasSkelett commented 3 years ago

Small summary and next steps:

We've the issue with the nightly slowdowns, that got worse and worse over the last few weeks. Earlier today I've discovered that we aren't serving user content (zips, background images, thumbnails) from the Apache Web Server [AWS] (which is inside the SpaceDock container, between ATS and gunicorn) directly, but pass every request through to gunicorn, which gets overwhelmed trying to serve all these big files (it really isn't made for that, and it blocks workers for a long time, and it seems to try to load the whole file into memory before responding to the request, instead of a chunked approach).

We have basically two ways to make AWS load the files from disk itself: 1) Prohibit proxy-passing the requests to gunicorn, and alias the /content/ route to the storage directory:

   ProxyPass /content/ !
   Alias /content/ /storage/sd-alpha/

2) Use the XSendFile feature. This basically makes gunicorn send an HTTP header (X-SendFile) with a file path as response, that tells AWS to read the file from disk and serve it itself. I think there's some kernel magic involved as well, in Kernel-world sendfile means copying a file from one file descriptor to another in Kernel space, instead of sending it to user space and moving it there, saving a few CPU cycles. See https://linux.die.net/man/2/sendfile. AWS makes use of this feature, I think. To use XSendFile, the cdn-domain config option needs to be unset due to how our code works right now. Thus it also no longer redirect to a /content URL, so caching from ATS would be disabled (maybe, don't know how it's configured).

So now we tried 1) in another attempt to combat the nightly slowdowns, but for whatever reason, it serves some files correctly, but some files return either a 502 or timeout after exactly 30s with cURL reporting:

curl: (92) HTTP/2 stream 0 was closed cleanly, but before getting  all response header fields, treated as error

The cause is unknown, might or might not be a bug in AWS, or a config error.

HebaruSan commented 3 years ago

(Easier to get back to this if it's open...)

DasSkelett commented 3 years ago

This sounds related:

apache reads wrong data over cifs filesystems served by samba https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900821

I've seen these weird kernel stack traces in alpha's syslog as well (expand):

``` Jun 23 14:27:28 sd-alpha kernel: [1113982.280769] Call Trace: Jun 23 14:27:28 sd-alpha kernel: [1113982.281730] pagecache_get_page+0xff/0x2e0 Jun 23 14:27:28 sd-alpha kernel: [1113982.282812] Code: Bad RIP value. Jun 23 14:27:28 sd-alpha kernel: [1113982.286344] [ 11974] 0 11974 9508 32 65536 88 0 master Jun 23 14:27:28 sd-alpha kernel: [1113982.287381] [ 24532] 33 24532 469445 54876 3878912 393876 0 gunicorn Jun 23 14:27:28 sd-alpha kernel: [1113982.288443] [ 22956] 33 22956 54373 32213 532480 1474 0 gunicorn Jun 23 14:27:28 sd-alpha kernel: [1113982.291749] [ 16111] 33 16111 56617 20435 548864 15527 0 gunicorn Jun 23 14:27:28 sd-alpha kernel: [1113982.295979] [ 11638] 33 11638 35034 1425 307200 12891 0 celery ``` ``` Jun 23 01:20:04 sd-alpha kernel: [1066737.502715] CIFS VFS: Close unmatched open Jun 23 01:28:32 sd-alpha kernel: [1067245.511085] CPU: 9 PID: 22260 Comm: apache2 Tainted: P O 5.4.119-1-pve #1 Jun 23 01:28:32 sd-alpha kernel: [1067245.511870] try_charge+0x76b/0x7e0 Jun 23 01:28:32 sd-alpha kernel: [1067245.512791] ? filemap_map_pages+0x28d/0x3b0 Jun 23 01:28:32 sd-alpha kernel: [1067245.513699] RAX: 0000000000000001 RBX: 00007f965c247398 RCX: 00007f965c77212b Jun 23 01:32:25 sd-alpha kernel: [1067478.244693] Call Trace: Jun 23 01:32:25 sd-alpha kernel: [1067478.245776] ? unlock_page_memcg+0x12/0x20 Jun 23 01:32:25 sd-alpha kernel: [1067478.246919] RSP: 002b:00007f965bfc2df8 EFLAGS: 00010246 Jun 23 01:32:25 sd-alpha kernel: [1067478.251938] [ 14162] 33 14162 187028 164910 1589248 1377 0 gunicorn Jun 23 01:32:25 sd-alpha kernel: [1067478.257485] [ 11609] 0 11609 58159 113 90112 94 0 accounts-daemon Jun 23 01:32:25 sd-alpha kernel: [1067478.259886] [ 11626] 112 11626 157310 0 143360 232 0 prometheus-apac Jun 23 01:34:09 sd-alpha kernel: [1067583.003212] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Jun 23 01:34:09 sd-alpha kernel: [1067583.009798] [ 31938] 33 31938 58478 4721 544768 33082 0 gunicorn Jun 23 01:34:09 sd-alpha kernel: [1067583.010122] [ 17399] 33 17399 50885 25731 479232 4498 0 gunicorn Jun 23 01:34:09 sd-alpha kernel: [1067583.011602] [ 11510] 0 11510 48202 222 401408 127 -250 systemd-journal Jun 23 01:34:09 sd-alpha kernel: [1067583.068167] oom_reaper: reaped process 23806 (gunicorn), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Jun 23 01:34:56 sd-alpha kernel: [1067629.879105] CIFS VFS: Close unmatched open Jun 23 01:34:57 sd-alpha kernel: [1067630.662049] CPU: 14 PID: 32080 Comm: apache2 Tainted: P O 5.4.119-1-pve #1 Jun 23 01:34:57 sd-alpha kernel: [1067630.662963] __add_to_page_cache_locked+0x265/0x340 Jun 23 01:34:57 sd-alpha kernel: [1067630.663878] __do_fault+0x3c/0x130 Jun 23 01:34:57 sd-alpha kernel: [1067630.665833] R10: 0000000000000000 R11: 0000000000000000 R12: 00007f965bfc2e6c Jun 23 01:34:57 sd-alpha kernel: [1067630.669566] [ 14162] 33 14162 273894 251817 2285568 1377 0 gunicorn Jun 23 01:34:57 sd-alpha kernel: [1067630.670871] [ 4935] 33 4935 53750 31736 536576 1357 0 gunicorn Jun 23 01:34:57 sd-alpha kernel: [1067630.672182] [ 16956] 33 16956 233781 211414 1957888 1669 0 gunicorn Jun 23 01:34:57 sd-alpha kernel: [1067630.678776] [ 12041] 33 12041 34710 0 294912 13924 0 celery Jun 23 01:34:57 sd-alpha kernel: [1067630.679878] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/101,task_memcg=/lxc/101/ns/system.slice/system-spacedock.slice/spacedock@8009.service,task=gunicorn,pid=17289,uid=33 Jun 23 01:35:31 sd-alpha kernel: [1067664.380327] [ 9113] 33 9113 175832 113258 1499136 41823 0 gunicorn Jun 23 01:35:31 sd-alpha kernel: [1067664.385528] [ 6118] 33 6118 48619 21245 491520 6719 0 gunicorn Jun 23 01:35:31 sd-alpha kernel: [1067664.386797] [ 12177] 33 12177 6703 191 90112 3213 0 gunicorn Jun 23 01:35:31 sd-alpha kernel: [1067664.388184] [ 5706] 33 5706 502606 35 299008 2226 0 apache2 Jun 23 01:35:31 sd-alpha kernel: [1067664.389079] [ 32674] 33 32674 502858 1025 294912 191 0 apache2 Jun 23 01:35:31 sd-alpha kernel: [1067664.389869] [ 11635] 101 11635 56087 203 86016 51 0 rsyslogd Jun 23 01:35:31 sd-alpha kernel: [1067664.391123] [ 11652] 0 11652 4119 82 73728 122 0 systemd-logind Jun 23 01:37:51 sd-alpha kernel: [1067804.264258] [ 27038] 33 27038 43890 14676 450560 8557 0 gunicorn Jun 23 01:37:51 sd-alpha kernel: [1067804.269497] [ 22026] 33 22026 535860 1497 364544 1112 0 apache2 Jun 23 01:37:51 sd-alpha kernel: [1067804.270306] [ 11621] 0 11621 6574 47 94208 1859 0 networkd-dispat Jun 23 01:37:51 sd-alpha kernel: [1067804.271272] [ 12041] 33 12041 34710 0 294912 13924 0 celery Jun 23 01:37:51 sd-alpha kernel: [1067804.272189] [ 11676] 0 11676 658 28 40960 0 0 agetty Jun 23 01:38:52 sd-alpha kernel: [1067865.728116] add_to_page_cache_lru+0x4f/0xd0 Jun 23 01:38:52 sd-alpha kernel: [1067865.729242] page_fault+0x34/0x40 Jun 23 01:38:52 sd-alpha kernel: [1067865.735235] [ 29446] 33 29446 50648 14055 487424 15938 0 gunicorn Jun 23 01:38:52 sd-alpha kernel: [1067865.736748] [ 8786] 33 8786 259890 236216 2158592 2969 0 gunicorn Jun 23 01:38:52 sd-alpha kernel: [1067865.739326] [ 12716] 33 12716 54471 32389 520192 1388 0 gunicorn Jun 23 01:38:52 sd-alpha kernel: [1067865.740787] [ 31553] 33 31553 502996 1534 299008 311 0 apache2 Jun 23 01:38:52 sd-alpha kernel: [1067865.818355] oom_reaper: reaped process 18953 (gunicorn), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Jun 23 01:40:25 sd-alpha kernel: [1067958.672277] [ 6118] 33 6118 48299 25648 491520 1996 0 gunicorn Jun 23 01:40:25 sd-alpha kernel: [1067958.675414] [ 31553] 33 31553 502888 1832 299008 298 0 apache2 Jun 23 01:40:25 sd-alpha kernel: [1067958.676665] [ 12022] 33 12022 34708 0 294912 13923 0 celery ```

Workaround/fix proposed in the comments:

Hi,

by default, apache uses mmap, so probably mmap is broken on cifs. An alternate workaround should be to set EnableMMAP off in the apache config.

Cheers, Stefan

I know from previous research that mmap with CIFS can be problematic.

HebaruSan commented 3 years ago

Collecting together what I surmise the optimal config would be for prod for easy pasting:

ProxyPass /content/ !
Alias /content/ /storage/sdmods/
# CIFS and MMAP don't get along
EnableMMAP off

HebaruSan commented 3 years ago

@V1TA5 reports that some of this load may be due to an anti-virus scan. If so, rescheduling it to not overlap US activity primetime may help.

DasSkelett commented 3 years ago

The virus scan is pretty much confirmed as cause. After disabling it yesterday, the big slowdown was gone, with only small, individual spikes in processing time, that could be attributed to general user-generated load: (Tonight in yellow, average Friday->Saturday night in light blue, all times UTC+2)

I'd leave this issue open for a bit, to make sure it wasn't a coincidence and check how it behaves the next days.

We also need to think about how to bring the scan back, in a less disruptive way. SpaceDock's mods should definitely be scanned, as we're essentially an open file hosting service accessible for everyone.

And a small follow-up to the EnableMMAP story: We set EnableMMAP off and the proxy bypass for /content/ on production, and it works great, no more corrupt zips. It may or may not have reduced CPU and memory consumption, I don't have data on this (and it might've been negligible compared to the virus scan, but it should be huge without it).

DasSkelett commented 3 years ago

With the recent production upgrade (#398) even the remaining slowdowns have disappeared completely: grafana screenshot showing processing time for monitoring web requests

All those performance improvements (including reducing storage/disk access, caching costly operations, reducing database queries etc.) really paid off!

I'm going to close this issue now as this problem is resolved. @HebaruSan opened some issues to track further improvement opportunities, and also for discussion about the virus scanning, as we do want to get it back in some form or another (with less performance impact).

KSP-SpaceDock / SpaceDock