Azure / Moodle

Tooling and guidance on deploying Scalable Moodle Clusters on Azure.
MIT License
153 stars 165 forks source link

Mitigate performance issues through cache configuration and other improvements. #215

Open asw101 opened 3 years ago

asw101 commented 3 years ago

This PR mitigates performance and transient reliability issues which we have identified during load testing via JMeter and the Latency-Sensitive Stress Testing (time-gated-exam.jmx) exam with tweaks and updates for the latest version. The changes are as follows:

  1. Sets the Moodle localcachedir to /tmp/localcachedir

    During testing of the Large size deployment, which defaults to Azure Premium Files as the external file share, we identified files in the /moodle/moodledata directory that caused increased latency. The first is the localcachedir directory which Moodle recommends using a fast local file system for when Moodle is clustered.

  2. Sets alternative_component_cache to /var/www/html/moodle/core_component.php

    This change is in conjunction with localcachedir and provides significant performance improvements when moodledata is located on an external file share such as Azure Premium Files (see related issue https://github.com/Azure/Moodle/issues/126 regarding GlusterFS). We chose this directory because it must already exist and the web server must have permissions to write to it.

  3. Increases default osDisk size from 30Gb (120 IOPS/3,500 Burst IOPS/25MB/sec) to 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec)

    During load testing we believe we may have hit IOPS and/or Throughput limits at either the Disk and/or VM level which can cause a VM to become unavailable. Updates to Disk and VM metrics will make this clearer. In order to mitigiate this we chose a Premium SSD size with significantly more IOPS and throughput.

    We initially chose 1,024Gb (5,000 IOPS/200MB/sec) because this size is the first that does not utilize the 3,500 "Burst" IOPS. Latency also decreased as the disk size was increased. However, a smaller size such as 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec) may be suitable and this PR changes from 30Gb to 256Gb.

    We applied this change to both the Virtual Machine Scale Set (VMSS) that handles the web traffic, as well as the Controller VM we use for JMeter testing (after resizing to match the VMSS), in order to maintain parity in terms of IOPS and throughput.

  4. Defaults Load Balancer and Public IP to the Standard SKU.

    We upgraded our Load Balancer and Public IP to the Standard SKU to enable the Multi-dimensional metrics and alerts, particularly "SNAT connections", to help avoid as well as confirm we do not experience issues such as SNAT Port Exhaustion.

These changes have been tested to deploy successfully against the current master, though load testing was performed against an earlier commit.

(Special thanks to @iennae for feedback and insights throughout!)

asw101 commented 3 years ago

Thank you @naioja for your tweaks for NSG with Standard Load Balancer. I have merged the current changes from master and resolved the merge conflict. I have also included your suggested snippet to ensure the alternative_component_cache directory exists!