drupalwxt / helm-drupal

Helm Chart for deploying an enterprise-grade Drupal environment.
https://drupalwxt.github.io/helm-drupal/index.yaml
MIT License
31 stars 22 forks source link

Using an Azure MySQL DB is twice as slow as an AKS hosted MySQL DB #109

Closed bernardmaltais closed 2 years ago

bernardmaltais commented 2 years ago

Good afternoon Will and Zach,

This is not an issue with the helm chart per say... but rather a question regarding the recommendation to use a CSP PaaS DB vs a K8s POD based DB for production. I just tried migrating our site to an Azure MySQL Flexible instance and I am finding that the site is less responsive... about half as responsive. The Azure MySQL PaaS instance is using a Private Endpoint on the same vnet as the AKS cluster running the Drupal solution. The MySQL PaaS instance is also super beefy... like as big as the node pool running the whole Drupal solution ;-) and yet it still perform poorly (not that the POD based DB was fast either).

Have you experienced the same at Stats Canada? Do you have any idea why this would be the case?

Regards,

Bernard

bernardmaltais commented 2 years ago

What is even stranger is that using sysbench I can confirm that the Azure MySQL is way faster than the AKS POD MySQL instance... but yet serving pages is twice as slow when using Azure MySQL...

POD MySQL:

root@troubleshooting-shell:~# sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=drupal-wxt-prod-mysql --mysql-port=3306 --mysql-user=wxt --mysql-password='dgdsfgfdsdfgsd9G' --mysql-db=wxt --db-driver=mysql --tables=3 --table-size=10000 --report-interval=10 --threads=128 --time=60 run 2>&1
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 128
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 128 tps: 36.17 qps: 870.10 (r/w/o: 630.92/154.36/84.82) lat (ms,95%): 5312.73 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 128 tps: 43.40 qps: 879.95 (r/w/o: 619.73/173.61/86.60) lat (ms,95%): 4943.53 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 128 tps: 44.80 qps: 872.40 (r/w/o: 605.60/176.70/90.10) lat (ms,95%): 4943.53 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 128 tps: 44.70 qps: 896.60 (r/w/o: 630.40/177.00/89.20) lat (ms,95%): 5217.92 err/s: 0.10 reconn/s: 0.00
[ 50s ] thds: 128 tps: 42.60 qps: 860.10 (r/w/o: 602.20/172.80/85.10) lat (ms,95%): 5312.73 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 128 tps: 43.40 qps: 863.50 (r/w/o: 601.70/175.00/86.80) lat (ms,95%): 5409.26 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            37492
        write:                           10708
        other:                           5354
        total:                           53554
    transactions:                        2676   (43.09 per sec.)
    queries:                             53554  (862.41 per sec.)
    ignored errors:                      2      (0.03 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          62.0964s
    total number of events:              2676

Latency (ms):
         min:                                  202.49
         avg:                                 2926.82
         max:                                 9196.74
         95th percentile:                     5217.92
         sum:                              7832167.98

Threads fairness:
    events (avg/stddev):           20.9062/1.71
    execution time (avg/stddev):   61.1888/0.63

Azure MySQL:

root@troubleshooting-shell:~# sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=myssc.mysql.database.azure.com --mysql-port=3306 --mysql-user=azureadmin --mysql-password='dfgdfgdsffdgdg' --mysql-db=wxt --db-driver=mysql --tables=3 --table-size=10000 --report-interval=10 --threads=128 --time=60 run 2>&1
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 128
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 128 tps: 240.64 qps: 5108.67 (r/w/o: 3623.26/985.54/499.87) lat (ms,95%): 1708.63 err/s: 5.90 reconn/s: 0.00
[ 20s ] thds: 128 tps: 265.91 qps: 5424.82 (r/w/o: 3812.09/1074.22/538.51) lat (ms,95%): 1561.52 err/s: 6.60 reconn/s: 0.00
[ 30s ] thds: 128 tps: 235.31 qps: 4815.79 (r/w/o: 3391.73/947.14/476.92) lat (ms,95%): 1708.63 err/s: 6.30 reconn/s: 0.00
[ 40s ] thds: 128 tps: 201.70 qps: 4170.93 (r/w/o: 2940.32/818.91/411.70) lat (ms,95%): 2120.76 err/s: 8.30 reconn/s: 0.00
[ 50s ] thds: 128 tps: 273.79 qps: 5605.46 (r/w/o: 3943.40/1106.47/555.59) lat (ms,95%): 1479.41 err/s: 8.00 reconn/s: 0.00
[ 60s ] thds: 128 tps: 246.30 qps: 5032.82 (r/w/o: 3536.41/997.60/498.80) lat (ms,95%): 1648.20 err/s: 6.20 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            212520
        write:                           59672
        other:                           29945
        total:                           302137
    transactions:                        14765  (241.76 per sec.)
    queries:                             302137 (4947.10 per sec.)
    ignored errors:                      415    (6.80 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          61.0717s
    total number of events:              14765

Latency (ms):
         min:                                   41.12
         avg:                                  525.00
         max:                                 5863.51
         95th percentile:                     1678.14
         sum:                              7751658.26

Threads fairness:
    events (avg/stddev):           115.3516/11.06
    execution time (avg/stddev):   60.5598/0.33
sylus commented 2 years ago

@bernardmaltais we actually solved this, and went from pretty horrible performance to it being roughly 3-5x faster.

You will need to use:

https://docs.microsoft.com/en-us/azure/mysql/howto-redirection

See how we did it:

Docker

https://github.com/drupalwxt/docker-scaffold/commit/103f2b6f6f9e1cc0db54bba6d2c2df889e27bde0

Helm Chart

  # PHP configuration
  php:
    ini:
      mysqlnd_azure.enableRedirect: on

  extraSettings: |-
    $databases['default']['default']['pdo'][PDO::MYSQL_ATTR_SSL_CA] = '/etc/ssl/mysql/BaltimoreCyberTrustRoot.crt.pem';

The problem is the Azure Gateway adds massive latency to requests and makes Azure MySQL pretty unusable, but with the redirect module in place you talk directly to the backend bypassing this and you will notice via Azure Metrics your actually getting high IOPS again instead of it being below 10% most of the time.

The recalls site uses this and also Azure CDN and is pretty darn fast:

https://recalls-rappels.canada.ca/

sylus commented 2 years ago

@bernardmaltais although i'm only using the regular Azure MySQL PaaS service with 512GB storage size for high IOPS. It works pretty well. I haven't tested the redirection module with the flexible service though.

https://github.com/canada-ca-terraform-modules/terraform-azurerm-mysql

bernardmaltais commented 2 years ago

@sylus Thank you for the pointer... I guess it is time to update our scaffold stuff ;-)

The flexible service is more "flexible" and actually less expansive than the basic offering... and can be made HA.

sylus commented 2 years ago

@bernardmaltais our problem at least when i last checked is the flexible service is missing some key security features such as key vault integration for CMK, threat_detection_policy, and managed identities.

We have started working on the terraform spec though and starting with the PostgreSQL one:

https://github.com/canada-ca-terraform-modules/terraform-azurerm-flex-postgresql

sylus commented 2 years ago

@bernardmaltais but yeah I'd be interested if the redirect module works with the flexible service so please let me know!

I was thinking since is using private zones there wouldn't have been a performance hit but last I tried the flexible service performance was pretty bad and very similar to the regular PaaS MySQL without the redirection, so I am pretty certain somehow it suffers from the same problem!

bernardmaltais commented 2 years ago

@sylus I know what I will work on this week-end.

sylus commented 2 years ago

I would check if the flexible service supports it in the regular Managed PaaS there is a configuration parameter. I wonder if it is there for the flex service?

On the server side, redirection also needs to be enabled. You can enable redirection by updating the redirect_enabled server parameter using the Azure portal or Azure CLI. You can also use "show global variables like 'redirect_enabled'" to check the current redirection enable status on server side.

bernardmaltais commented 2 years ago

I am not attached to the flexible version as I just started testing today ;-) I will rebuild with the basic version and make it working 1st. Then I can try switching to flexible and see if this also work... baby steps. Man, this is turning in much more work than I expected... I will report back.

bernardmaltais commented 2 years ago

@sylus Just saw that redirection does not support Private Link... arg! Not so pleased about that. Azure is a bit of a bag of hurt when it comes to AKS and PaaS DB access!

sylus commented 2 years ago

Yeah though not AKS related since its really their database implementations that are the core problem. Azure Web Apps suffers from the same problems though now they have an option for the redirect issue as well.

https://stackoverflow.com/questions/63970567/php-on-azure-app-service-slow-performance-when-connected-to-azure-database-for-m

Interesting while PostgreSQL and MySQL suffer from this problem strangely MSSQL doesn't as they do something underneath the scenes for some reason.

Just a reminder don't forget to set your storage size to 512GB to get decent IOPS which is what we do in conjunction with the redirect setting. Then just do a reserved instance for a year or 3 if u can and will save 60%

bernardmaltais commented 2 years ago

@sylus I have been able to deploy and enable mysqlnd_azure... but I am not seeing any performance difference with my previous Flexible Private Endpoint DB... Maybe the responsiveness issue is not related to an Azure Gateway latency afterall... But it was worth trying:

bash-5.0$ php test.php
mysqlnd_azure.enableRedirect: 1
b1e662624122.tr2764.canadacentral1-a.worker.database.windows.net via TCP/IP
mysqli_result Object
(
    [current_field] => 0
    [field_count] => 1
    [lengths] =>
    [num_rows] => 294
    [type] => 0
)

The strange thing is that running sysbench show faster throughput and smaller latency when using the Azure MySQL DB vs POD... but real world site performance it at the opposite I have no idea what is going on.

sylus commented 2 years ago

So you did the following?

a) Enabled the mysqlnd redirection on php side b) Enabled the redirection server side in the mysql portal in the database configuration parameters c) You are using tls 1.2 and the baltimore cert d) sqlproxy is disabled

The performance difference was night and day for me, a fresh drupal install went from about 20-25 mins down to about 4-6 mins. If you do a fresh install how long does it take? Also while it is doing the install and you go to the database and select metrics and then IOPS do you see it peaking above say 60% and not always below 10%?

sylus commented 2 years ago

I also did some minor tweaks MySQL wise but I don't think any of these settings were particularly important:

https://github.com/canada-ca-terraform-modules/terraform-azurerm-mysql/blob/master/main.tf#L85...L143 https://github.com/canada-ca-terraform-modules/terraform-azurerm-mysql/blob/master/variables.tf#L233...L271

And finally the storage size of your db is set to 512?

bernardmaltais commented 2 years ago

a) Enabled the mysqlnd redirection on php side : yes b) Enabled the redirection server side in the mysql portal in the database configuration parameters : yes c) You are using tls 1.2 and the baltimore cert : yes d) sqlproxy is disabled : (I never enabled sqlproxy... so unless it is on by default I should not have to do that)

Doing a DB restore (fairly small DB that take 10 seconds on my dev machine to restore) take about 5 minutes on POD. I did notice that the Azure MySQL was giving my equivalent page loads as the POD based DB... so maybe the direct connect made a slight difference in leveling things out... but DB performance is still bad compared to a local dev laptop.

I am not expecting performance to be equivalent... but I was hoping to at least reduce the small restore to take only 1 minutes instead of 5 ;-(

I could share a copy of the DB if you are willing to try to restore on your AKS cluster vs on mine.

I started with a storage size of 256 and that gave me equivalent page load time as the POD. I increased the size to 512 and for some reasons page load time went up by one seconds after...

image

sylus commented 2 years ago

Yeah i'd expect a hit of around 20% in my experience then local and the MySQL db will def not be faster then an in cluster db as well due to networking and other considerations so that will never match unfortunately. I'd take a look at the mysql configuration I did above and set similar values which should help you a bit.

If you want near performance as your local sadly it will be flex server since technically gives you your own VM and SSD but the gateway problem will be your limiter.

I am using Gen5_v4 at 512GB storage and MySQL v8.0

bernardmaltais commented 2 years ago

OK, I will try making the suggested changes and then try upgrading the DB to match yours... maybe this will make a difference.

sylus commented 2 years ago

Yeah so:

Gen5_v4 512 GB for 1532 IOPS mysql_version 8.0 innodb_buffer_pool_size 16106127360 max_allowed_packet 536870912 table_definition_cache 5000 table_open_cache 5000

I obviously use redis and varnish as well to save both logged in and anonymous users performance respectively.

bernardmaltais commented 2 years ago

I do use varnish... but not redis... maybe the fact that I am not using redis is putting too much pressure on the DB for caching... and is not helping... Still... does not explain why it would take 5 minutes to restore a small db...

sylus commented 2 years ago

Yeah but redis is mainly for logged in users and if you have lots of content / views and make user of lots of entity references etc.

Agreed 5 mins for a small db seems a bit long, I don't fully remember for the recalls site but it is about 15 gb uncompressed and I think takes about 15 mins to import, but I am doing a restore to dev tomorrow so can get a more accurate time as pretty sure it is faster then that and report back.

Basically I just log into the container and do a:

drush sql-cli < db.sql

I can confirm that a containerized db or even a VM with mysql installed was faster, but that makes sense for Azure SQL since a 512GB instance will only be 1536 IOPS and a containerized and/or VM instance on SSD would be factors higher. This however wouldn't be the case with flexibile server which should be much faster but that darn gateway :P

In the end due to security concerns, and maintenance of db being offloaded went with Azure SQL and coupled with redis + Azure CDN etc found the performance to be pretty good once caches got built up but it wont ever be as fast as local sadly.

Though can see the pages are pretty snappy in the recalls site:

recalls-rappels.canada.ca

sylus commented 2 years ago

@bernardmaltais just one last tip just in case, copy your db.sql.gz to the /tmp folder in container and ensure not running the restore when its in the azure file mount or the restore will take forever.

cd /tmp
drush sql-drop
zcat db.sql.gz | drush sql-cli

Also can check this to make some adjustments to improve the import then can set back:

https://docs.microsoft.com/en-us/azure/mysql/concepts-migrate-dump-restore#preparing-the-target-azure-database-for-mysql-server-for-fast-data-loads

bernardmaltais commented 2 years ago

@sylus Interesting point. I do restore from the backup file located in the Azure File volume... I will try to use a custom restore script to copy the file 1st to temp and see if this make a difference.

So far my testing has shown that running MySQL in a POD is still the fastest... Here are the time to do a restore on 3 different MySQL:

POD: 4m49s Azure MySQL direct connect: 5m23s Azure MySQL flexible with private endpoint: 7m1s

Bernard

sylus commented 2 years ago

Hey @bernardmaltais so I was wrong about flex it doesn’t need redirect so given that and it’s properly colocated it should absolutely not be slower then the single server.

As I to eventually want to move to flex server for the advantages and it being pure Linux I’d love to know what is going on as well.

Planning to take a look myself but please to let me know if found anything. I can even engage with azure support.

sylus commented 2 years ago

Oh did u check out thread pools for flex?

https://techcommunity.microsoft.com/t5/azure-database-for-mysql-blog/achieve-up-to-a-50-performance-boost-in-azure-database-for-mysql/ba-p/2909691

bernardmaltais commented 2 years ago

@sylus I will have a look at the blog post on flexible server. Not sure why it is the slowest for me... possibly a configuration issue... but you can most certainly feel the difference when accessing the site as pages load twice as slow. I reached out to MS about that and I have not gotten an answer. Not a support ticket but reaching to their AKS guy...

bernardmaltais commented 2 years ago

@sylus I have deployed redis from the chart. How can I tell if it is active? I just used the defaults from the chart values.yaml file... one of the line was commented out so I left it not knowing what it would do. Any recommendations? Your recall site is super fast and I am trying to get close to that.

Also, have you ever written a tuning guide for your site? Asking for a friend (literally) ;-)

sylus commented 2 years ago

Hi @bernardmaltais for redis yeah all you should need:

Here is what mine looks like:

## Redis
## Ref: https://github.com/bitnami/charts/tree/master/bitnami/redis
redis:
  enabled: true
  password: ${var.redis_password_rsams}
  image:
    registry: docker.io
    pullSecrets:
      - acr-registry
  clientInterface: PhpRedis

You should then see a redis master and 2 replicas come up. The helm chart should enable redis for you but in case doesn't just run a drush en redis. Then you should see redis enabled and connection in the status report. With redis enabled the following gets enabled in settings.php:

Probably will need to clear your cache with a drush cr. Just a note that our database was huge so we needed to bump our redis size to 16GB.

bernardmaltais commented 2 years ago

Thanks. I was able to deploy and according to JMeter is reduced logged in user page load by about 20%. I saw that APCu is also recommended to be deployed on the drupal instance... but from what I could read it look like it is better fir for single node systems and that Redis essentially does the same thing. Have you done any testing between APCu and redis?

sylus commented 2 years ago

As i understand it there is no need for APC (or any of the similar sort of bytecode caching extensions like XCache) as of PHP 5.5 and later. The PHP developers directly integrated what they call OPCache which we are using. For the userdata portion of ACPu yeah redis will supercede that.

sylus commented 2 years ago

Any word on Flex Server @bernardmaltais also did you enable proxysql im the helm chart? As they say should do that to.

bernardmaltais commented 2 years ago

@sylus I have not. I have been testing with the redirection. Once the log4j wave has passed I should have more time to look into that again. But for now the speed of the redirection is acceptable ;-)

sylus commented 1 year ago

Just wanted to chime back that I made the switch to Flex Server using the Business Critical tier and performance seems to be pretty good!