Open asw101 opened 3 years ago
Thank you @naioja for your tweaks for NSG with Standard Load Balancer. I have merged the current changes from master and resolved the merge conflict. I have also included your suggested snippet to ensure the alternative_component_cache directory exists!
This PR mitigates performance and transient reliability issues which we have identified during load testing via JMeter and the Latency-Sensitive Stress Testing (time-gated-exam.jmx) exam with tweaks and updates for the latest version. The changes are as follows:
Sets the Moodle localcachedir to
/tmp/localcachedir
During testing of the Large size deployment, which defaults to Azure Premium Files as the external file share, we identified files in the
/moodle/moodledata
directory that caused increased latency. The first is thelocalcachedir
directory which Moodle recommends using a fast local file system for when Moodle is clustered.Sets
alternative_component_cache
to/var/www/html/moodle/core_component.php
This change is in conjunction with
localcachedir
and provides significant performance improvements whenmoodledata
is located on an external file share such as Azure Premium Files (see related issue https://github.com/Azure/Moodle/issues/126 regarding GlusterFS). We chose this directory because it must already exist and the web server must have permissions to write to it.Increases default osDisk size from 30Gb (120 IOPS/3,500 Burst IOPS/25MB/sec) to 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec)
During load testing we believe we may have hit IOPS and/or Throughput limits at either the Disk and/or VM level which can cause a VM to become unavailable. Updates to Disk and VM metrics will make this clearer. In order to mitigiate this we chose a Premium SSD size with significantly more IOPS and throughput.
We initially chose 1,024Gb (5,000 IOPS/200MB/sec) because this size is the first that does not utilize the 3,500 "Burst" IOPS. Latency also decreased as the disk size was increased. However, a smaller size such as 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec) may be suitable and this PR changes from 30Gb to 256Gb.
We applied this change to both the Virtual Machine Scale Set (VMSS) that handles the web traffic, as well as the Controller VM we use for JMeter testing (after resizing to match the VMSS), in order to maintain parity in terms of IOPS and throughput.
Defaults Load Balancer and Public IP to the Standard SKU.
We upgraded our Load Balancer and Public IP to the Standard SKU to enable the Multi-dimensional metrics and alerts, particularly "SNAT connections", to help avoid as well as confirm we do not experience issues such as SNAT Port Exhaustion.
These changes have been tested to deploy successfully against the current master, though load testing was performed against an earlier commit.
(Special thanks to @iennae for feedback and insights throughout!)