dlareau / puzzlehunt_server

Server for Puzzle hunts run by Puzzlehunt CMU, but can be repurposed for other hunts.
MIT License
19 stars 16 forks source link

Speed up the server #11

Closed dlareau closed 5 years ago

dlareau commented 7 years ago

Profile website further and find ways to reduce lag. It was better Fall 2016, but apparently still got slow at the end.

EggyLv999 commented 7 years ago

One way is going to be to paginate the queue to decrease the amount of loading needed upon refreshing queue.

dlareau commented 7 years ago

Paginated the queue page in 416e8a0.

Not calling this issue fixed just yet.

dlareau commented 7 years ago

Putting this here for later use: http://www.revsys.com/blog/2015/may/06/django-performance-simple-things/

dlareau commented 7 years ago

All requests should be met with relevant responses. E.g: If the user submits an answer, the response should come back in the HTTP response, not via ajax on regular updates.

dlareau commented 7 years ago

Helped with load in 8fd2e2eb

dlareau commented 6 years ago

An update on the state of lag during a hunt (outside of other pending issues such as announcements) would be nice. Knowing which pages are slow would allow better examination of code for profiling. As it stands everything loads very quickly when not during a hunt, but that is expected.

dlareau commented 6 years ago

Just noting, after the update made 20 minutes into the hunt on 3/31/18, I thought everything ran really well. I'd be interested in how the staff side seemed after that fix.

dlareau commented 5 years ago

And we're back... In the hunt yesterday (10/6/18) things got really bad. Requests were taking over 5 seconds to load, some weren't loading at all. This is mildly weird given how well things worked at the end of last hunt and how nothing particularly resource intensive was added in the 3.2 update.

I'm going to propose the following path forward.

This is really the only way to go about it because the server always seems fine until the day of the hunt when it gets like this, so we need some concrete way to impose a hunt sized load on it.

Feel free to add more, but the current plan I have for speeding things up is:

dlareau commented 5 years ago

As always, anybody is welcome to help with this.

dlareau commented 5 years ago

Another note: If we can't get results with the fixes mentioned above to help with the slowdowns, the next step would be to look into a multi-server setup on something like AWS so we can scale dynamically. But that is A) a pain, B) would cost money, and C) probably not needed. There are multiple reports of people being able to run over a million views a day on a single instance, so I think we just need to be smarter about some of the above items like caching.

TomWildenhain commented 5 years ago

It seems like this is a good plan of attack. With my limited knowledge of how the server works, it seems like reducing AJAX requests on the answer submission page might have a huge impact. People often open multiple puzzles in different tabs, and that might be making things slow.

TomWildenhain commented 5 years ago

I would be very willing to consider moving to paid hosting if necessary. I imagine that the cost would be small compared to the costs of other parts of our hunts and the gains would be pretty large (automatic scaling). Ideally we would want hosting that doesn't charge much during the times when the server has very low demand.

dlareau commented 5 years ago

So the ajax is an issue, but it shouldn't be that much of an issue since 8fd2e2e made it so that ajax requests only happen when the page has focus. (So if they have 10 puzzle pages open in 10 tabs of the same window, only the foreground one will be making ajax requests).

Once I get the load tester working, we can evaluate if moving to paid hosting would help and if the club is willing to pay for it, I'll happily help move the server over.

TomWildenhain commented 5 years ago

Sounds good. It will definately be interesting to have the load testing results.

dlareau commented 5 years ago

Just a quick update: During the hunt apparently AJAX requests made up 90% of our traffic, with us getting ~70 AJAX requests a second. Unfortunately that is exactly what we would expect from ~40 teams of ~6 people with each person having one puzzle open (puzzles check for answers every 3 seconds). I still want to get the load tester up and running to have concrete data on how much better we will get, but something tells me if I can optimize the AJAX request logic to be even a little faster, a lot of this load problem will go away.

TomWildenhain commented 5 years ago

It seems like the live updating of puzzle answers is mainly for cases when the Puzzlehunt staff manually changes the response to a submission. We used to do all responses manually, but now the server handles 99% of them automatically, so it might not be a big deal if the checking rate was dropped from once every 3 seconds to maybe once every 30 seconds. Do you think you could do an exponential backoff of the AJAX? We normally update a response fairly quickly if we are going to change it.

TomWildenhain commented 5 years ago

Liam said he did a total of 5-10 edits over the course of the hunt and most of them occurred within seconds of submission.

dlareau commented 5 years ago

Thats good to know, its true that the live updating was definitely a holdover from before auto-answering became the default in #32 . I could see something like a 10 second starting period backing off to a 2-3 minute period doing well for the current setup and would lower the number of AJAX calls drastically.

TomWildenhain commented 5 years ago

Yeah, I think maybe 5 seconds, 10 seconds, 30 seconds, and then 1 minute would be good. The backoff would get reset after each answer submission. Since most teams don't submit answers continuously, this would limit most users to a polling rate of 1 request be minute. Do we have any other continuously polling pages other than the chat and admin pages?

TomWildenhain commented 5 years ago

Also I'm noticing that the progress page takes a while to load even when the server is not in high demand (like right now). The progress page auto-refreshes, so this actually might contribute significantly. It is probably slow because of all the django queries necessary to populate the chart. Is there any way to cache/optimize this?

dlareau commented 5 years ago

Do we have any other continuously polling pages other than the chat and admin pages?

No, the only polling pages are:

As for the progress page, it is indeed a hugely resource intensive page as it has to comb through the team table, puzzle table, the submssion table, the unlock table, and the solve table. The good news is that once it has been loaded once, the AJAX requests are relatively cheap. Each AJAX request just checks the latest entry in 3 tables to see if there is anything newer than the comparison value it sends along.

Some data to back this up: Right now loading the progress page for me took 1.89 seconds, however every AJAX request after that took only ~115ms, which is shorter than the request time of most static files.

screen shot 2018-10-08 at 10 54 28 am

In my opinion, the only thing that could really be done for the progress page would be to have a client-side checkbox for fast updates. The main display computer could check this box to keep the current rate of AJAX requests and then we could make the default slower. This sort of solution only really works for staff pages where we can trust staff to check the box responsibly.

TomWildenhain commented 5 years ago

Since the AJAX requests aren't nearly as intensive as the initial load, I don't think there is any problem with the progress page. I'm not too concerned about the queue page or the staff chat page. I imagine they cause minimal server load. The general chat page could be an issue if teams leave it open. It of course has to be a bit more responsive than the puzzle pages, and people might leave it out of focus but still want it to update.

timparenti commented 5 years ago

Not sure the current state of the chat page updates, but perhaps that one specifically should poll at a (much) lower rate if out-of-focus?

dlareau commented 5 years ago

Chat page right now already does nothing when it doesn't have tab-focus. The idea being that when they tab back to it it will update and they won't notice it wasn't updating when it wasn't visible.

Also just to clear up terminology here, I was using just the word "focus" which isn't the clearest. A more accurate phrase might be "is the visible tab" the difference being a puzzlehunt page will still update if it is the visible tab in a not focused window. As far as I can tell for security/process level isolation reasons, it is not easily possible for a webpage to know if the window it is in is the one that has focus, only if it is the currently visible tab in a window.

dlareau commented 5 years ago

For those curious, here is the breakdown of page views for the day of October 6th. requests.txt Uninteresting pages and pages with 3 or fewer views have been hidden.

These ratios are what will be used in the load tester.

TomWildenhain commented 5 years ago

When you say "EACH PUZZLE AJAX ~120,000", that's like a total of like 2,000,000, right?

dlareau commented 5 years ago

Yeah, but as mentioned above: 2,000,000 requests / (8 60 60) seconds = ~70 requests per second, which is the expected value for ~40 teams of ~6 people each with a puzzle open that does an AJAX request every 3 seconds.

Numbers get big quickly serving stuff at this scale. Even with the proposed backoff, we'd still likely only cut it down to 500,000 requests over the course of the hunt.

dlareau commented 5 years ago

Progress on this can be followed over on the load_test branch.

dlareau commented 5 years ago

Good news update: The ajax backoff works.... sort of. The website is useable with 240 users and ajax backoff and un-usable with 240 users and no ajax backoff so it definitely had an effect. However it's also unusable with 480 users and ajax backoff, so if we continue to scale we'll just hit the same roadblocks. Next step is improving caching so stuff like these ajax requests don't even make it to django.

TomWildenhain commented 5 years ago

Nice work! This should definitely help for next hunt. I wonder if we can work with computer club to get a more powerful server since it does seem like we might need that at some point. Just to confirm, the majority of the load is limited by the server's processing power, not bandwidth, right?

Incidentally, it is quite possible that our next hunt will involve a large .js file loaded on the main hunt page. My immediate plan would be to put it in the static folder, but you mentioned that that could cause a lot of load. Do you have any thoughts about how to deal with large static files?

dlareau commented 5 years ago

I don't believe we are currently processing power bound, well, at least not in the traditional sense. I don't think having a better CPU or more cores will magically make things better. It is true that we are waiting on computations, but I think that it is because we are sitting on a lot of software architecture/configuration currently set up for a mid-load application. The hardware should be fine for the load we need, its just that we need to tune the software to actually be able to utilize all of it (for example, I don't think we hit 100% or even 90% utilization on any physical resource during the hunt).

The one of the problems with this hunt was the continued and constant load of static file requests from the first round meta. Normally browsers can intelligently cache static files, but the direct requesting of static files from the javascript like the meta did overrides that and continually bogs down the static file server. A single large .js file on the hunt page should be fine, because each user will only load it once and then their browser will cache it for some time.

Even that being said, one of my plans mentioned in some comment above is to move the static file server to a separate software so a bunch of static file requests only slow down other static file requests and not every request.

dlareau commented 5 years ago

Also, now that I can properly load test, I can see which parts are failing, like last night I realized that our database connection gets a bit unhappy under load. Here are two database requests for the same data that were both part of the same web request:

screen shot 2018-10-19 at 9 09 03 am

First of all, I should change the code to ideally not make the exact same database request twice in the same web request, but secondly, ~60ms is the time of a whole web request under light load, html rendering and everything. Under heavy load we're taking that long on a single database request.

dlareau commented 5 years ago

Closing this and opening more specific issues for better tracking. Otherwise this will stay open forever because there are always things that could be better. Having a number of more specific issues will allow better planning for future versions.