facebookarchive / fbctf

Platform to host Capture the Flag competitions
Other
6.53k stars 1.31k forks source link

Performance Issue(s) #456

Open punkdis opened 7 years ago

punkdis commented 7 years ago

6d04149d9939687faf1bc6b7be1764213ad06d3b

AWS EC2, Ubuntu 14.04.5 LTS on a m4.xlarge

When we have over 150 participants over 20 mins the CPU slowly goes to 80%. Once at 80% performance goes to the crapper and the UI is slow and participants are seeing the impact in game play.

image

juliannagler commented 7 years ago

Thanks for filling the issue @rfrillman - we'll investigate asap.

justinwray commented 7 years ago

Do you happen to know if you were running Memcached? Was the database server on the same system as the webserver (a single system)?

punkdis commented 7 years ago

Yes Memcached was running.

memcache 1359 0.0 0.2 327440 2580 ? Sl 04:11 0:00 /usr/bin/memcached -m 64 -p 11211 -u memcache -l 127.0.0.1 ubuntu 1886 0.0 0.0 11744 932 pts/0 S+ 04:19 0:00 grep --color=auto memcached

Yes the database server was on the same system as the webserver.

justinwray commented 7 years ago

Interesting, nearly all of the data is cached now (even sessions). So you shouldn't see a high load in relation to the database. In the past we've seen this issue when Memcached wasn't running/working.

Did you have the binlog enabled by chance? Or the general query log? (Note: You probably shouldn't for performance reasons - but it would be useful for some investigative information in this case).

punkdis commented 7 years ago

binlog is/was disabled. I can see if there are any query logs

punkdis commented 7 years ago

After reviewing the logs there is nothing that stands out. Maybe the resources for m4.xlarge is not enough.

image

I have another event in a month or so and I am going to move to the C4.4xlarge

image

shargon commented 7 years ago

We would like to expose our experience using the Facebook CTF platform. First of all, the platform was deployed on a Virtual Private Server (OVH SSD3 VPS Server), whose specifications are:

After deploying the platform for our CTF, was tested with 15 players and we did not experience any problem of flooding or lagging in the platform.

Once the competition was opened, 340 participants were registered and the platform remained open from 02/16/2017 to 02/19/2017. At this point the platform began to flood, being almost impossible to interact with the interface, even when there were only 30 players ingame.

The memory system (RAM) was not a problem, however, the CPU load increased considerably. To correct these problems, we made the following modifications:

The main problem is given by HHVM, and although the modifications managed to fix the lagging problems at times, we did not manage to correct the problems completely and all users started complaining about the flooding of the system until we delete the user sessions and restart HHVM service again.

punkdis commented 7 years ago

Any other thoughts?

wisco24 commented 7 years ago

I had this same issue awhile back, and still do at times for large events. HHVM is tying up the CPU waiting for the database queries. I just ran an event with 30 teams last week 50-60 sessions, and used a c4.4xl (16 cores) but moved the DB to AWS Aurora db.r3.2xl. (working on the code so that I can do a pull request for it) Late in in the afternoon when the game log grew due to all the guesses/answers, the CPU was sitting around 60%. The biggest help from my testing was having the DB separate. so the DB and HHVM were not both beating up on the same CPU.

wisco24 commented 7 years ago

From the queries I looked at, the issue is with how the leader board scores are done. They are recalculated every time a team scores by querying/adding the full game logs table. A much more efficient way would be to store the score in a separate DB table so it doesn't have to go through the all the guesses each time.

wisco24 commented 7 years ago

Just looked at pull request #459 and that might be a mitigation method as well as it will create separate DB's, etc. and then possibly put a Load Balancer in front of it with sticky sessions so a team stays stuck to one system.

punkdis commented 7 years ago

Has anyone used the Live Sync API?

punkdis commented 7 years ago

image

I ran another event and the mysqld was off the chart and caused the game to become unresponsive Any suggestions? Or any immediate commands I could issue to free up the CPU?

git rev-parse HEAD 51e06a7094df9d9e66249b40cae387d2150a52fd

wisco24 commented 7 years ago

Exactly what I was seeing.... Best way I found was the offload the DB to a different server. In my case, I move the DB to AWS Aurora RDS, and then have the rest on a rather large EC2 instance....

I tried to restart mysqld and hhvm when I had this issue before, but it cause the scoring system to get messed up and award double points for the same question.

punkdis commented 7 years ago

@wisco24 do you have any documentation on how you setup the DB on a different server? I have another event scheduled for September

stevcoll commented 7 years ago

@punkdis The multi server changes were just merged in the dev branch. Depending on how many servers you want to separate, here are the general instructions for production. Note that the additional options you see with IP addresses are instructing as to the location of hhvm, mysql, and cache servers. So generally you would provision in this order:

Update: See Justin's comment below for correct commands

punkdis commented 7 years ago

@stevcoll Do you know when the changes will be merged into production?

stevcoll commented 7 years ago

The last I heard master and dev were in conflict so it could be months, but I'll let @justinwray answer that one.

justinwray commented 7 years ago

@punkdis / @stevcoll dev and master are currently out of sync.

However, given the number of improvements, including performance updates, in dev it might be a good time to merge that work into master.

Merging dev into master would also bring the new server/container separation to master as well.

In the meantime, you can certainly pull down the dev branch, to do so you need to do the following:

From that point forward all other documentation applies and you will now be using the dev branch.

Additionally, documentation for the new installation options can be found in the commit log here: https://github.com/facebook/fbctf/commit/b487fc119f3cef02b289e13d29b80272c4d94c2a

You can use the individual provision.sh executions as @stevcoll provided, or the quick_install option for multiple direct installations:

fredemmott commented 7 years ago

Restart HHVM service every 5 minutes.

Test this very carefully - as HHVM is a JIT with a high warmup cost, this will usually make things much worse.

In general, when looking at HHVM's CPU usage, the best tool to use is 'perf' from the linux perf tools; as HHVM generates a perf PID map, this will show both the native C++ and JITed PHP functions in the same stack traces. If you get lots of memory addresses with out names in the stack traces, the PID map option is probably off or there's a permission issue reading it; if you get no PHP functions or raw memory addresses, it's probably not warmed up yet and there's no JITed code.

You can also use perf to profile the entire system, not just HHVM - so if, for example, there's a problem with MySQL, this will show you the PHP call to MySQL, the C++ HHVM builtin talking to MySQL, the C functions in MySQL, and ultimately the kernel calls that MySQL is spending its' time in.

5hubh4m commented 7 years ago

HEAD: 1f236bba7e5e7a3e7169ff61037502a6928a94ba on an EC2 m4.x16large with a separate AWS Aurora SQL Database.

The platform (coupled with our naiveté) caused us a lot of misery, and possibly the reputation of our yearly CTF event CodeFest '17 CTF

utilisation

As soon as the contest began, the website slowed to a crawl. Error 504s everywhere. I shifted the MySQL database to AWS Aurora and upgraded m4.xlarge to m4.x16large. Still virtually no difference.

ubuntu@ip-172-31-20-69:/var/log/hhvm$ wc -l error.log 6853801 error.log

Mid-way into the contest, the game board stopped opening at all as GET-ing /data/teams.php would always fail. It's still not working, even after stopping the game and dropping all entries from sessions.

ubuntu@ip-172-31-20-69:/var/log/hhvm$ cat error.log | grep -v mysql | grep -v translation | tail [Sun Sep 24 14:52:45 2017] [hphp] [15327:7f9ac5bff700:735:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:45 2017] [hphp] [15327:7f9ac13ff700:1561:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:736:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/map-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1562:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:737:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1563:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/map-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1564:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1565:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/command-line.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:738:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/configuration.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1566:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main}

There were around ~500 team registrations and around 276 teams managed to make a submission. There were as many as 160 sessions open simultaneously.

justinwray commented 7 years ago

@5hubh4m:

Thank you for submitting your report (and using the existing issue thread)! Could you please provide the quick_setup or provision details/options you used?

It sounds like your game has already concluded, but if you do still have the system running, and are experiencing performance issues, please take a look at https://github.com/facebook/hhvm/wiki/Profiling#linux-perf-tools. The reports from pref, sorted by both Self and Children during performance degradation will be beneficial for further troubleshooting.

Even if you are unable to obtain the pref reports, did you happen to capture the output from top or ps during the performance degradation? The output from such tools may help track down which process was using an excessive amount of CPU.

The output from your error_log are entries showing requests that require a valid session but were sent unauthenticated. In other words, these "errors" are simply requests where someone was not logged in. Given that sessions do timeout, this frequently happens when people leave their browser windows open on the scoreboard, but otherwise inactive.

@5hubh4m / All:

We have been dedicating much focus on the overall performance of the platform. One area of significant improvement has been in the caching of database requests. Nearly all requests are now cached, and the results are stored for subsequent usage unless the cached data is invalidated on specific actions. As a result, the total number of database queries is very low.

Another major area of performance improvement has been the ability, through the quick_setup or provision process, to separate the various components of the platform onto different servers/instances/containers. In total, you can install the platform components across four servers; one for Nginx, one for HHVM, one for MySQL, and one for Memcached. Furthermore, on platforms like Amazon Web Services (or equivalent cloud solutions) you can use services such as RDS for MySQL and ElastiCache for Memcached, to provide more purpose-tailored performance.

While we continue to do load and stress testing of our own, we, unfortunately, have been unable to reproduce the issue(s) detailed in this thread. Our testing, with hundreds of users, on an even smaller resource footprint (2 GHz of CPU and 2 GBs of RAM) shows negligible latency for the end-user. One thing that has become apparent through our testing is that MySQL, given the caching, is not experiencing a significant amount of load. Instead, it becomes more and more likely that HHVM is the cause of excessive CPU utilization, system load, and subsequent latency.

We are finalizing testing that will migrate the platform from Ubuntu 14.04 and HHVM 3.18 to Ubuntu 16.04 and HHVM 3.21. While this will not ensure a performance boost, it does include performance improvements to HHVM itself.

One commonality we see in the performance reports is the usage of Amazon Web Services. While the platform is certainly capable of running on AWS (and some of our testing takes place on AWS as well), there is one important aspect of AWS T2 instances to consider: CPU Credits. When using a t2 instance on AWS, you are given a set number of CPU credits and a baseline CPU level. For example, on a t2.medium you are given 24 CPU credits with a CPU baseline of 40%. A single CPU credit is worth 1 minute of 100% CPU usage. As credits are earned and stored (for 24 hours), you could burn through your entire balance of credits sequentially. For a t2.medium you can store a total of 576 credits (earned over 24 hours). In other words, with a t2.medium you are given 40% of the CPU resources across the board, with the ability to boost (on average) to 100% for 24 minutes per hour. Learn More about CPU Credits here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html#t2-instances-cpu-credits

We will continue to dedicate effort to improving the performance of the platform and continue to test and troubleshoot in attempt to recreate the aforementioned issue(s).

When experiencing a performance issue, please include the following in your reports:

5hubh4m commented 7 years ago

@justinwray

That is easy. hhvm was constantly on top with 5000-6000 % CPU consumption. That snippet of error.log I posted didn't contain the Resource Not Available errors because I wasn't able to locate them in the 1G text file. The ones I did mention are CERTAINLY not due to unauthenticated access, as those were done after the game was over and the only session active was mine. The perf command when used during the game to see syscall errors was riddled with EAGAINs.

I installed the game using provision.sh using the letsencrypt options and single server setup. The Db was then dumped using mysqldump, moved to Aurora and I edited settings.ini to reflect the changes made.

Interestingly, both disk and RDS were not the bottleneck. The RDS witnessed 400 connections at the peak but the CPU usage remained 20-25%. The disk usage (which was a provisioned IO disk on AWS) also didn't was out with it's provisioned IOPS values.

ubuntu@ip-172-31-20-69:~$ uname -a Linux ip-172-31-20-69 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@ip-172-31-20-69:~$ memcached -help memcached 1.4.14

ubuntu@ip-172-31-20-69:~$ hhvm --version HipHop VM 3.18.5 (rel) Compiler: tags/HHVM-3.18.5-0-g61f6a1f9a199c929980408aff866f36a7b4a1515 Repo schema: 514949365dd9d370d84ea5a6db4a3dd3b619e484

ubuntu@ip-172-31-20-69:~$ nginx -v nginx version: nginx/1.4.6 (Ubuntu)

I don't have the top or ps outputs unfortunately.

Also, a comment was made on CTFTime about fbctf's performance.

illustris – Sept. 22, 2017, 5:45 p.m. You should have tested the platform well before using it. We considered using it for BITSCTF last year, but a simple load test with 50 people made it crawl on a fairly powerful VPS

stevcoll commented 7 years ago

@5hubh4m Regarding those errors, they are generated when somebody does not have a session, but for instance leaves the scoreboard open. Until a hard refresh is performed, they are not redirected to the login page, and the requests are coming in without a session. These look to be the typical requests performed during an AJAX refresh, which happens every ~5 seconds on each platform window.

For instance, even if you removed all users, or disabled login, or invalidated all sessions in the platform. Any user that was previously authenticated, but still resides on the gameboard page, will generate these errors. They also come up upon a hard refresh that does redirect to the login page (before the redirect invalid requests are coming in).

As @justinwray pointed out, we are heavily focused on reproducing these issues and finding a resolution. Unfortunately the biggest problem has been duplicating these live environments with hundreds or thousands of users, and ensuring our test provision and configuration matches exactly what the game admins are utilizing.

This morning I setup a stress test where I provisioned on a VM with only 2GB of memory and 2 cores. After setting up various flags and quizzes, with different configurations, I then hammered the platform with constant logins and subsequent endpoint requests of all types for over 6 hours, with hundreds of sessions, and an estimated typical user load of a few hundred users.

I saw the memory for HHVM go up about 8% during the event, and HHVM CPU was very consistent at 100% (ranging between 50% and 150% or about 1.5 cores worth of utilization). NGINX and Memcached showed CPU at around 10-30% on spikes, and MySQL easily handled the requests without even coming up on top. The bottleneck seemed to be HHVM, and yet even after 6 hours I was able to capture levels, bring up countries, and perform typical platform actions in multiple browsers without any noticeable latency.

The perf command showed that requests to Memcached were using up the most CPU in the HHVM process, which probably makes sense considering that a majority of MySQL queries were cached, and very few queries are actually sent to MySQL. Once a user captures a level or other actions take place, Memcached is invalidated, MySQL gets the queries once, then they are again cached into Memcached.

The next step on our end is now to try to reproduce the problem on AWS itself on instance types such as the ones you used, and see if that environment is involved in the issue.

fredemmott commented 7 years ago

@mpomarole : do you remember why this switched from the mcrouter extension to memcached in https://github.com/facebook/fbctf/commit/d5846c54eb418fae8ebf1eba7a926c53bba7faa3 ?

fredemmott commented 7 years ago

not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems, whereas the memcached extension basically exists for PHP compatibility.

wisco24 commented 7 years ago

Steve, The problem grow when a team gets ones answer right amount all the other wrong answer. This causes there score to be recalculated by each team. I did a custom load testing script that worked well and was able to generate a fair amount of load. It's far from perfect, but worked for what I needed to generate load on the system that caused the issues. The problem I found with the stress testing is its about the refresh on the scoreboard that causes the most load as HHVM reserves CPU while waiting for the mysql command result which when you have a high number of failed guesses, takes alot of time.

https://github.com/wisco24/ctfloadtesting I never fully automated it, but got it working well enough for what i needed with some manual steps...

-Eric

On Tue, Sep 26, 2017 at 11:58 AM, Fred Emmott notifications@github.com wrote:

(not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebook/fbctf/issues/456#issuecomment-332079777, or mute the thread https://github.com/notifications/unsubscribe-auth/AUM9APf7uKdSK2BhTbjySVyR8a8gnwvXks5smHZ1gaJpZM4MChD2 .

-- Eric Shulze (920) 540-4735 eshulze@gmail.com

wisco24 commented 7 years ago

The other issue is after the user interaction stops, HHVM never drops its resources, it still stays at the 100% for some reason. Also, for your cacheing example, its not as great as that.. Think of a one day event with 30 teams, and 30 flags. That's 900 times the memchache is invalidated.

That over 100 times an hour, or about once a minute... the caching doesn't help as much then...

On Tue, Sep 26, 2017 at 7:32 PM, eshulze@gmail.com wrote:

Steve, The problem grow when a team gets ones answer right amount all the other wrong answer. This causes there score to be recalculated by each team. I did a custom load testing script that worked well and was able to generate a fair amount of load. It's far from perfect, but worked for what I needed to generate load on the system that caused the issues. The problem I found with the stress testing is its about the refresh on the scoreboard that causes the most load as HHVM reserves CPU while waiting for the mysql command result which when you have a high number of failed guesses, takes alot of time.

https://github.com/wisco24/ctfloadtesting I never fully automated it, but got it working well enough for what i needed with some manual steps...

-Eric

On Tue, Sep 26, 2017 at 11:58 AM, Fred Emmott notifications@github.com wrote:

(not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebook/fbctf/issues/456#issuecomment-332079777, or mute the thread https://github.com/notifications/unsubscribe-auth/AUM9APf7uKdSK2BhTbjySVyR8a8gnwvXks5smHZ1gaJpZM4MChD2 .

-- Eric Shulze (920) 540-4735 eshulze@gmail.com

-- Eric Shulze (920) 540-4735 eshulze@gmail.com

stevcoll commented 7 years ago

@5hubh4m @fredemmott @justinwray @wisco24

Last night I attempted to duplicate the @5hubh4m scenario with some load testing scripts. The general setup was starting with an AWS EC2 m4.large (2 Cores, 8GB memory). Standard production provision with all services running on the one server. Added 10 quizzes and 10 flags, description text heavy, some with attachments, hints, different point values, etc. Turned on registration and started the event.

Each Perl script "client" was run from one of two separate m4.large instances and these scripts registered on the platform, logged in, then went through an endless cycle of loading the gameboard page then performing AJAX requests every 5 seconds (3 times). The total cycle was 20 seconds. After each cycle, the script had a 2% randomized chance of submitting a flag. When that happened, the script then had a 20% chance of getting the flag answer correct.

Started off with 50 script sessions running. This brought HHVM to about 125% utilization out of a total 200%. The platform was usable, but I did notice some throttling of the HHVM access log every few seconds, where it would "jam up" before continuing. Essentially the CPU on HHVM ranged from 50% to 150% every few seconds. Memory on HHVM and the entire platform was a non-issue.

I increased to 150 script sessions, and that brought the CPU average up to around 170%. Platform still usable, but now seeing cases where Modals were taking 10 seconds to pop up.

At 300 script sessions, the CPUs were maxed out on HHVM at 200%. The platform at this point became completely unusable as the HHVM access log crawled and froze up constantly. The script sessions started dying from timeouts. I ran perf for the first time with these results:

perf01

perf02

I then stopped the instance and upgraded to a m4.16xlarge (64 Cores, 256GB memory). Resumed the event starting with 350 sessions. HHVM immediately ate up a significant amount of CPU on all 64 cores. The platform was usable but again Modals were taking seconds to popup.

I started seeing this error all over the place (which I had not seen even once on the m4.large:

Error executing AsyncMysql operation: Failed (mysql error: 1040: Too many connections)

This caused many of the scripts to timeout and/or die. Increased the number of connections to a million in the MySQL configuration and restarted that. This seemed to solve the issue for a few minutes, before I instead got this error:

Error executing AsyncMysql operation: Failed (mysql error: 2004: Can't create TCP/IP socket

Finally at around 400 users I started getting this error message on the scripts which was very strange:

Error GETing /data/teams.php: hphp_invoke
Error GETing /data/country-data.php: hphp_invoke

The admin sessions page would now not even load, with the following error message:

[Tue Sep 26 07:36:35 2017] [hphp] [1817:7f8f25bff700:1194:000001] [] \nFatal error: Uncaught exception 'HH\\InvariantException' with message 'all_teams should of type Team and not null' in /var/www/fbctf/src/models/MultiTeam.php:72\nStack trace:\n#0 /var/www/fbctf/src/models/MultiTeam.php(72): HH\\invariant_violation()\n#1 /var/www/fbctf/src/controllers/AdminController.php(3749): MultiTeam::genTeam()\n#2 /var/www/fbctf/src/controllers/AdminController.php(4091): AdminController->genRenderSessionsContent()\n#3 /var/www/fbctf/src/controllers/AdminController.php(4104): AdminController->genRenderPage()\n#4 /var/www/fbctf/src/controllers/Controller.php(37): AdminController->genRenderBody()\n#5 /var/www/fbctf/src/Router.php(73): Controller->genRender()\n#6 /var/www/fbctf/src/Router.php(20): Router::genRouteNormal()\n#7 /var/www/fbctf/src/index.php(7): Router::genRoute()\n#8 (): genInit()\n#9

Somewhere between 400 and 500 users the platform was again unusable. The HHVM access log again crawling and unable to even remotely keep up with requests. Could not load the platform page. I performed another perf and got the resource graphs from AWS:

image

image

image

The timeline on the graph is that the event started at 6:30GMT, and the instance was upgraded to the m4.16xlarge at 7:15GMT. The reason CPU isn't pegged some of the time is really my inability to keep the session scripts running while MySQL had too many connections or dying from timeouts.

Overall Memcached does seem to be a primary bottleneck inside of HHVM as far as I can read perf. I don't understand how HHVM could utilize so many cores on the m4.16xlarge. Neither MySQL or NGINX had any issues whatsoever in terms of resource usage. Memcached ended up utilizing 1.5 cores towards the end of my test with ~450 sessions.

fredemmott commented 7 years ago

HHVM defaults to 2x number of cores, and you probably actually want higher: each concurrent request needs a thread.

A few things:

stevcoll commented 7 years ago

@fredemmott Yeah I did run it as root (that was required for it to even function). I ran the exactly command here:

sudo ./perf record -g --call-graph dwarf -p 5076

The report generated with this:

sudo ./perf report -g

stevcoll commented 7 years ago

@fredemmott Also I installed perf from apt-get then used the binary from here as the documentation specified. Perhaps I need the entire install from a newer source?

http://dl.hhvm.com/resources/perf.gz

fredemmott commented 7 years ago

Should be able to get it from your distribution package manager, sounds like the instructions our outdated.

We should also consider optimizing for the common single-web server case and use APC instead by default - if value forms such as dict are used, no serialization or unserialization. Will need the caching to change to be pluggable, but that would be good anyway.

justinwray commented 7 years ago

@wisco24 That is an excellent example to illustrate what our usage of Memcached is and is not doing.

If you have 30 teams and 30 flags, and every team captures all of the flags, some of the Memcached data (related to scoring) would be invalidated a total of 900 times. However, only a subset of the data in the cache would be invalidated through these actions. Furthermore, while this will cause some queries to execute 900 times, the previous implementation, before the data was cached, would have had these requests being sent on every scoreboard request, or thousands upon thousands of times. Not to mention the other data that isn't invalidated from the scoring action.

To be clear, previously every user/session would be generating a query for the values within the database. They would be querying for teams, levels, scores, hints, etc. The queries for such data would be happening on every scoreboard request (including the AJAX refreshes) and on any new page or modal load. With the new implementation, the queries happen once the cached data is invalidated or missing and only once, with all users sharing the results. So instead of all 30 teams making tens of thousands of requests throughout the event, you are down to a few thousand across all of the teams.

Memcached is certainly not intended to replace the database, or entirely shield the database from the workload. Specifically, Memcached is implemented to store and reuse database results for as long as they are valid.

However, you are right; the more "busy" a game is, the more often queries will take place.

@stevcoll

Great job with the testing and reproducing the issue!

You cut off part of this error line:

Error executing AsyncMysql operation: Failed (mysql error: 2004: Can't create TCP/IP socket

Directly after the word socket should be a number; this number would indicate why the sockets could not be created. Most likely you ran out of file handles or TCP/IP sockets at the kernel level.

Take a look a ulimit on that instance to see some of the userspace limitations and if you are hitting a kernel limit take a look at net.core.somaxconn or the backlog values (net.core.netdev_max_backlog, ipv4.tcp_max_syn_backlog).

As we can see from your error:

all_teams should of type Team and not null

We are not getting valid results from MySQL, which is to be expected if the server is unable to fulfill the request.

We may want to look into further MySQL, Memcached, and system-level tuning for the production installation mode. While there is still optimization to be achieved within the platform code, the default configurations for these services are unlikely to be solid enough for larger events.

@fredemmott

given how unserialize is showing up, we should look at exactly what’s been stored in memcached and if we can store something cheaper to serialize/unserialize instead

Currently, the result Map are being stored in Memcached.

fredemmott commented 7 years ago

Maps should be relatively efficient; I'm wondering if:

@stevcoll : can you make sure hhvm.perf_pid_map is on in your hhvm settings?

justinwray commented 7 years ago

@fredemmott

Maps should be relatively efficient

I would think so as well.

we're storing a larger map than we should

Less likely, as it's usually just a simple set of keys/values from the database row (so they are fairly small, in some cases just a single int).

we're simply fetching stuff from the cache more often than we should

This is more likely an area for improvement. The MC calls are taking place in the various class methods, and while there is not a major amount of overlap, there is certainly some.

Which leads to the option of potentially building a Cache class that handles the fetching of MC results and stores them in an object - so if a subsequent cache request is made, the data is retrieved from HHVM's memory instead of reaching out to MC.

fredemmott commented 7 years ago

Which leads to the option of potentially building a Cache class that handles the fetching of MC results and stores them in an object - so if a subsequent cache request is made, the data is retrieved from HHVM's memory instead of reaching out to MC.

I'll be putting some more thoughts on this over in #572

justinwray commented 7 years ago

Performance Testing and Improvements Update:

We continued to perform various load, stress, and scalability testing and wanted to provide an update to everyone here.

Improvements

First, let's cover the improvements. We have submitted three (3) Pull Requests that have provide performance improvements:

The first PR, Facebook and Google Login Integration & LiveSync Update, is primarily a feature improvement, providing integration with Facebook and Google (for registration and login); however, this PR also includes a few minor performance enhancements, primarily around the use of caching. The performance difference between this PR alone should be relatively minimal, but it is one of the changes involved in the code we are testing.

The second PR, Local Caching of Database and Cache Results, is the primary performance update of the set. This new local caching is a major improvement on performance and scalability. From the PR message:

  • Results from the database (MySQL) are stored and queried from the Cache service (Memcached). Results from Memcached are now stored locally within HHVM memory as well.

  • Through the new Cache class a global cache object is created and used throughout a single HHVM execution thread (user request).

  • This change results in a 99.9995% reduction in Memcached requests (at scale), providing a major improvement in scalability and performance.

As you can see this code change is dramatically lowering the number of queries to Memcached, which based on our recent testing has shown to be a major area of performance degradation. While this is by no means a resolution to issue #572, it is a stop-gap until further rewrites can of the models can be complete. Regardless, the performance and scalability implications of this new local cache are massively beneficial.

The last PR, Blocking AJAX Requests, is a further improvement on PR #565. From the PR:

  • AJAX requests for the gameboard are now individually blocking. A new request will not be dispatched until the previous request has completed.

  • AJAX requests will individually stop subsequent requests on a hard-error.

While this PR will not prevent performance issues, it does keep them from being exasperated by non-stop user requests. The particular implementation ensures that a user does not request the same (AJAX) data more than once at a time - that is, it will wait until the previous request completes before requesting the refresh again. Again, this will not solve an existing performance degradation situation, but it will prevent the problem from getting worse.

Testing

Next, let's cover the new testing results, based on the new code mentioned above. We utilized the existing dev branch code, with the new PRs, referred to above, merged. We also included a modified version of PR #564 (based on the discussion in the PR thread - we took just the schema indexing improvements). You can grab a copy of the fully merged code here: https://github.com/WraySec/fbctf/tree/merge. If you wish to test this code yourself, you can do the following:

For our testing, we utilized the recommended production installation: Production Mode with Multi Server Setup.

As we have covered in recent testing and posts to this thread, HHVM has been the major bottleneck, especially given the previous reliance and absurdly high number of hits to Memcached from HHVM. As such our testing has been focused on the HHVM server (as we are using separate servers for each service, HHVM is on its own server). In fact, during our testing, we never hit the limitations of a properly configured and tuned Nginx, Memcached, or MySQL server. The limitations always came from the HHVM server exclusively.

Results

Now, onto the results. With the new code improvements, we were able to support 75-150 simultaneous sessions, per CPU core. That gives us 75 sessions, optimally, per core with the platform still being usable with up to 150 sessions. Past 150 sessions, while the platform would work, the latency was increasingly unbearable. The HHVM system would finally complete fail at approximately 1000 user sessions per core. Furthermore, the server regardless of CPU cores or general resources has a maximum upper limit of 2000 sessions. That is, no matter the hardware level, you will not successfully exceed 2000 sessions - this limitation was hit due to constraints of the networking stack (sockets, client ports, backlogs, network speed/capacity, etc.).

One of the issues we were previously seeing, before the new improvements, was a lack of scalability with additional resources. Increasing the amount of processing power (or servers) provided little to no improvement in performance. With these new changes that is not the case, instead, the performance of the platform is linearly scalable with the system resources. So providing additional processing power will increase the performance or supportable load of the platform. This change is a significant improvement and should allow people to dynamically scale the platform based on their level or users, sessions, and load.

To further test the scalability we implemented load-balancing of the HHVM server instances. This load-balancing allowed us to deploy multiple HHVM servers to distribute the load across and increase the number of users supported. The numbers are scalable from the original numbers provided: 75-150 sessions per core, with a slight increase when utilizing a cluster of HHVM server, due to the networking relief each individual server experiences.

In fact, we deployed two HHVM instances with eight cores each and hit an upper bound of approximately 3500 users.

Resources

To summarize the resource requirements of the system, based on these new improvements, we have the following:

Minimum

1 CPU Core per 150 Users

Recommended

1 CPU Core per 75 Users

Over 2000 Users

HHVM, regardless of CPU cores, must be separated across multiple servers Each HHVM server can only handle a maximum of 2000 users

Note: Memory usage was never a concern, as we never hit any upper limits of memory utilization. For transparency, we had approximately 2GB of memory per core in our testing. However, again, we stayed well in the single digits on memory utilization throughout our testing.

Conclusion

The new improvements have provided a massive improvement to performance and scalability of the project. We will continue to perform load testing and make improvements where possible. Our direct next steps will be a review of each AJAX end-point utilized on the game board for possible performance increasing improvements.