Closed DasSkelett closed 3 years ago
I like the idea of making the visualization pages admin-only, since then we could put the new routes at /admin/profiling
and /admin/profiling/<int:id>
, which would help to keep them out of the way of the normal site routes.
It might even make sense to make profiling the default admin page, because the list of users isn't really that interesting.
On closer inspection, I am not sure that flask-profiler
and snakeviz
are compatible.
flask-profiler
seems to only collect HTTP headers, start time, and elapsed time, which could be useful to a project that wishes to find out which of their routes are slow, but does not provide the kind of stack trace sampling that would be needed for flame/icicle graphs.
snakeviz
says that it parses data files generated by cProfiler
, which is not what flask-profiler
seems to be doing. Maybe there is another plugin that can do it...
Also looked at Flask-Profile
/ flask.ext.profile
, which seems about the same, it just collects route run times rather than stack trace samples.
This one mentions both Flask and cProfiler. https://werkzeug.palletsprojects.com/en/1.0.x/middleware/profiler/ Possibly we could dump data into a directory and then get a listing of those files for the visualization page. But so far I don't see how to turn it on and off for different requests; it seems that it would profile everything once you enable it.
snakeviz
seems to depend on IPython / Jupyter, so it may not be able to generate visualizations in a flask process (even though it says it is "a browser based graphical viewer").
More chances to get my hopes up:
This one turns a .prof file from cProfile into an .svg, maybe we could run that in a scheduled task to pre-render all the data we've accumulated, then list the .svg files in the profiling admin route; or it even has a mode where it outputs svgs directly into the profiling directory!:
For the record on GitHub, profiling works great, and we've already done some huge optimizations and some are still in the pipeline (#345, #370), the nightly slowdowns are not solved, instead they seen to be even worse.
They very heavily look like they're caused by the storage for the mod zips, backgrounds and thumbnails misbehaving (an SMB/CIFS mount on the host, then bind-mounted into the container, I think). This results in very high server load, slowing down other application, but also freezes disk access itself: (Note that this requst took 14.5 seconds to complete)
Small summary and next steps:
We've the issue with the nightly slowdowns, that got worse and worse over the last few weeks. Earlier today I've discovered that we aren't serving user content (zips, background images, thumbnails) from the Apache Web Server [AWS] (which is inside the SpaceDock container, between ATS and gunicorn) directly, but pass every request through to gunicorn, which gets overwhelmed trying to serve all these big files (it really isn't made for that, and it blocks workers for a long time, and it seems to try to load the whole file into memory before responding to the request, instead of a chunked approach).
We have basically two ways to make AWS load the files from disk itself:
1) Prohibit proxy-passing the requests to gunicorn, and alias the /content/
route to the storage directory:
ProxyPass /content/ !
Alias /content/ /storage/sd-alpha/
2) Use the XSendFile
feature. This basically makes gunicorn send an HTTP header (X-SendFile
) with a file path as response, that tells AWS to read the file from disk and serve it itself. I think there's some kernel magic involved as well, in Kernel-world sendfile
means copying a file from one file descriptor to another in Kernel space, instead of sending it to user space and moving it there, saving a few CPU cycles. See https://linux.die.net/man/2/sendfile. AWS makes use of this feature, I think.
To use XSendFile, the cdn-domain
config option needs to be unset due to how our code works right now. Thus it also no longer redirect to a /content
URL, so caching from ATS would be disabled (maybe, don't know how it's configured).
So now we tried 1) in another attempt to combat the nightly slowdowns, but for whatever reason, it serves some files correctly, but some files return either a 502 or timeout after exactly 30s with cURL reporting:
curl: (92) HTTP/2 stream 0 was closed cleanly, but before getting all response header fields, treated as error
The cause is unknown, might or might not be a bug in AWS, or a config error.
(Easier to get back to this if it's open...)
This sounds related:
apache reads wrong data over cifs filesystems served by samba https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900821
Workaround/fix proposed in the comments:
Hi,
by default, apache uses mmap, so probably mmap is broken on cifs. An alternate workaround should be to set EnableMMAP off in the apache config.
Cheers, Stefan
I know from previous research that mmap with CIFS can be problematic.
Collecting together what I surmise the optimal config would be for prod for easy pasting:
ProxyPass /content/ !
Alias /content/ /storage/sdmods/
# CIFS and MMAP don't get along
EnableMMAP off
@V1TA5 reports that some of this load may be due to an anti-virus scan. If so, rescheduling it to not overlap US activity primetime may help.
The virus scan is pretty much confirmed as cause. After disabling it yesterday, the big slowdown was gone, with only small, individual spikes in processing time, that could be attributed to general user-generated load: (Tonight in yellow, average Friday->Saturday night in light blue, all times UTC+2)
I'd leave this issue open for a bit, to make sure it wasn't a coincidence and check how it behaves the next days.
We also need to think about how to bring the scan back, in a less disruptive way. SpaceDock's mods should definitely be scanned, as we're essentially an open file hosting service accessible for everyone.
And a small follow-up to the EnableMMAP
story:
We set EnableMMAP off
and the proxy bypass for /content/
on production, and it works great, no more corrupt zips. It may or may not have reduced CPU and memory consumption, I don't have data on this (and it might've been negligible compared to the virus scan, but it should be huge without it).
With the recent production upgrade (#398) even the remaining slowdowns have disappeared completely:
All those performance improvements (including reducing storage/disk access, caching costly operations, reducing database queries etc.) really paid off!
I'm going to close this issue now as this problem is resolved. @HebaruSan opened some issues to track further improvement opportunities, and also for discussion about the virus scanning, as we do want to get it back in some form or another (with less performance impact).
Description (What went wrong?):
Every day/night between 23:00 UTC and 05:00 UTC, SpaceDock's server side processing times experience spikes from around 500ms to 2-4s, sometimes even up to 8 seconds. This is measured from a server with a very good connection to SpaceDock's, against the
/kerbal-space-program/browse/top
endpoint, using the Prometheus Blackbox Exporter.The measurement is split up in different request steps (see the first picture):
Here's a second graph of the processing time only, divided in buckets, and to a linear scale.
This does not only affect
/kerbal-space-program/browse/top
, if you visit any other page on SpaceDock during these times you'll experience equally slow response times / long loading times.Fixing approach
The latency spikes happen during the evening in the US, likely when the request load is highest on SpaceDock. There's one or more bottlenecks we need to find and fix to avoid such high processing times.
We can probably rule out the network bandwidth, otherwise we should see the spikes in "transfer", "tls" and "connect", not "processing". Also alpha is totally fine during these times. It could be the database being overloaded, lock contention or whatever. It could be the template rendering taking so long. It could be gunicorn having too few or too many workers (or gunicorn instances per se, we're still at 6 instances à 8 workers). It could be memory pressure / exhaustion (maybe coupled with too many gunicorn workers). It could be some expensive code path. ...
Profiling
To get us closer to the cause, it would be nice if we could do some profiling of SpaceDock's performance.
There are basically three different ways to do this:
Profile individual requests on local development server
This can (and did) reveal some duplicated database calls, repeatedly called functions that are expensive and could/should be cached, and some other stuff. The data we can get out of this is very limited though, and very far from real production performance data.
Profile alpha/beta using load tests
We could enable some profiler in alpha/beta, hit it hard, and try to find the problem. Could give some hints about where it fails under load, has a lower risk of affecting production. Just hammering a single endpoint won't give us accurate real-world data, trying to simulate real-world traffic will be hard to impossible (at least without knowing how the real-world traffic looks like). Alpha+beta can be accessed by the two primary code maintainers of SpaceDock (me and @HebaruSan), so getting data in and out of there would be very easy and can be done without having to coordinate too much with @V1TA5.
Profile production by sampling x real-world requests per time period (or a set %)
This would give us the most accurate data, matching what the production server actually experiences. Tracing every request would cause performance to drop even more, so we'd need to restrict the profiling to only a few requests every now and then (how often can be discussed once we found a way how to do it). We can also leave this running in the background indefinitely. Since only @V1TA5 (and Darklight) have access to production, changing profiling settings and getting profile data out of it will be difficult. This one would probably require us to make the profiling data available directly as part of SpaceDock, e.g. by either running a web visualizer, or by making it downloadable somehow. Should be admin-locked.
In the end we probably get the most out of it if we have all three possibilities. ( Where local profiling is basically already possible, but we could make it simpler to set up)
Profiling Tools
We already found the following:
flask-profiler
: https://github.com/muatik/flask-profiler#samplingsnakeviz
: https://jiffyclub.github.io/snakeviz/