Azure / Unreal-Pixel-Streaming

MIT License
124 stars 82 forks source link

Pixel Streaming in Azure

Important

UPDATE: There was an Azure Marketplace solution for Unreal Engine Pixel Streaming, which simplifies the deployment process and adds things like lifecycle management, a custom metrics dashboard and more: https://azuremarketplace.microsoft.com/marketplace/apps/epicgames.unreal-pixel-streaming?tab=Overview; however, this solution has been taken down from the Marketplace and Epic Games has open sourced it here: https://github.com/ue4plugins/UnrealPixelStreamingOnAzure

Important: Before cloning this repo you must install the LFS extension at: https://git-lfs.github.com/ and open a git/console command window and type git lfs install to initialize git-lfs. Then in your cloned folder, you need to run "git lfs install". There are large binaries in the repo, thus we needed to enable Git Large File Storage capabilities. Also, due to licensing we are unable to include \Engine\Binaries\ThirdParty dlls exported from Unreal for your app in this repo, so you'll need to copy your own Binaries\ folder into the repo and check them in before the PixelStreamingDemo.exe app will run locally and remotely. See the Unreal 3D App section for details of this and other important steps.

Important: The main branch of this repo supports 3D Applications targeting Unreal Engine 4.27. If your application uses the previous 4.26 version of Unreal Engine please change the branch to ue-4.26 or use the v4.26 release tag.

Contents

Overview

This document goes through an overview on how to deploy Unreal Engine's Pixel Streaming technology in Azure at scale, which is a technology that Epic Games provides in their Unreal Engine to stream remotely deployed interactive 3D applications through a browser (i.e., computer/mobile) without the need for the connecting client to have GPU hardware. Additionally, this document will describe the customizations Azure Engineering has built on top of the existing Pixel Streaming solution to provide additional resiliency, logging/metrics and autoscaling specifically for production workloads in Azure. The additions built for Azure are released here on GitH, which consists of an end-to-end solution deployed via Terraform to spin up a multi-region deployment with only a few Terraform commands. The deployment has many configurations to tailor to your requirements such as which Azure region(s) to deploy to, the SKUs for each VM/GPUs, the size of the deployment, HTTP/HTTPs and autoscaling policies (node count & percentage based).

For a detailed overview of Unreal Pixel Streaming and architectures in Azure, see our documentation here. For a more simplified quick-start for the process on manually deploying to a single VM with Matchmaker and Signaling Server in Azure, see the Microsoft documentation here. To jump directly to the documented steps for deploying this solution in Azure, click here.

Additions Added by Microsoft

Microsoft has worked with Epic to customize Pixel Streaming for the cloud using Microsoft Azure, which has resulted in many key additions to deploy and monitor a Pixel Streaming solution at scale (some can be found here: GitHub PR request #7698). Below are the notable additions that have been incorporated into a Fork of Unreal Engine on GitHub:

Architecture

User Flow

Let's walk through the general flow of what is showed in the architecture diagram above when a user connects to the service:

  1. Clients connect to their closest region (1 .. N regions) via Traffic Manager, which does a DNS redirect to the Matchmaking service VM.
  2. The Matchmaking Service redirects to an available node on the paired VMSS which holds the Signaling Service and Unreal 3D app (doesn't use the Load Balancer). The VMSS nodes have public Ips for each VM and not a single private LB IP, otherwise the Matchmaking Service won't be able to redirect to the appropriate VMSS that's available (i.e., a LB would pick a random one)
  3. The Signaling Service streams back the 3D app rendered frames and audio content to the client via WebRTC, brokering any user input back to the 3D app for interactivity.

Azure SKU Recommendations

Below are the recommended compute SKUs for general usage of Pixel Streaming in Azure:

Important: It is recommended to first deploy your Pixel Streaming executable and run it on your desired GPU SKU to see the performance characteristics around CPU/Memory/GPU usage to ensure no resources are being pegged and frame rates are acceptable. Consider changing resolution and frames per second of the UE4 app to achieve acceptable quality per your requirements. Additionally, consider the IOPS / latency requirements for the 3D app when choosing a disk, as SSDs and/or striping disks will be key to gaining the best disk speed (some GPU SKUs might not support Premium SSDs so also consider disk striping for adding IOPS).

Optimizing Pixel Streaming in Azure

Be sure to check out the Pixel Streaming in Azure Overview documentation to learn more about optimizing for Azure VM SKUs, performance and pricing optimizations.

Recommendations for Further Optimizations

The current customized solution in GitHub has many additions that make deploying Pixel Streaming in Azure at scale easier, and below are even more improvements on those customizations which would make it even better:

Configurations

Below are notable configurations to consider when deploying the Pixel Streaming solution in Azure.

Terraform Configuration

There was a tremendous amount of work that went into building out the Terraform deployment for Pixel Streaming; however, unless you plan on making major modifications you can focus just on the following 3 files:

iac\terraform.tfvars: This stores the global variable for the deployment_regions, which specify which Azure region(s) will be used (default is "eastus") and their Virtual Network ranges:

iac\region\variables.tf: This is the most important file to be familiar with, as it has the configs for the gitpath (change to your Git fork), pixel_stream_application_name (change to your UE4 app name), along with other notable parameters such as desired FPS (default 60), resolution (default 1080p), starting instance count (default 1), instances per node (default 1), and Azure VM SKUs for the MM (default Standard_NV6) and SS (Standard_F4s_v2).

iac\variables.tf: This global variables file can mostly be ignored, unless needing to change the global resource group's name (base_resource_group_name), location (global_region, default: eastus), Traffic Manager port or storage account settings (tier/type).

See the Terraform section to learn more about the deployment files.

Deployment Script Configurations

The Git location referenced in the deployment is stored in the iac\region\variables.tf file. Important: You must have read access with a Personal Access Token (PAT) to the specified repository for the deployment to work, since when the VMs are created there is a git clone used to deploy the code to the VMs. Also, you'll want to validate if your organization needs to have Enterprise SSO enabled for your PAT.

Matchmaker Configuration

Below are the configurations available to the Matchmaker, which a config.json file was added to the existing Matchmaker code to reduce hard coding in the Matchmaker.js file:

{
  // The port clients connect to the Matchmaking service over HTTP
  "httpPort": 80,
  // The Matchmaking port the Signaling Service connects to the matchmaker over sockets
  "matchmakerPort": 9999,
  // Instances deployed per node, to be used in the autoscale policy (i.e., 1 unreal app running per GPU VM) – not yet supported
  "instancesPerNode": 1,
  // Amount of available Signaling Service / App instances to be available before we must scale up (0 will ignore)
  "instanceCountBuffer": 5,
  // Percentage amount of available Signaling Service / App instances to be available before we must scale up (0 will ignore)
  "percentBuffer": 25,
  //The amount of minutes of no scaling up activity before we decide we might want to see if we should scale down (i.e., after hours--reduce costs)
  "idleMinutes": 60,
  // % of active connections to total instances that we want to trigger a scale down if idleMinutes passes with no scaleup
  "connectionIdleRatio": 25,
  // Min number of available app instances we want to scale down to during an idle period (idleMinutes passed with no scaleup)
  "minIdleInstanceCount": 0,
  // The total amount of VMSS nodes that we will approve scaling up to
  "maxInstanceScaleCount": 500,
  // The Azure subscription used for autoscaling policy (set by Terraform)
  "subscriptionId": "",
  // The Azure Resource Group where the Azure VMSS is located, used for autoscaling (set by Terraform)
  "resourceGroup": "",
  // The Azure VMSS name used for scaling the Signaling Service / Unreal App compute (set by Terraform)
  "virtualMachineScaleSet": "",
  // Azure App Insights ID for logging and metrics (set by Terraform)
  "appInsightsId": ""
}

Signaling Server Configuration

Below are configs available to the Signaling Server in their config, some added by Microsoft for Azure:

{
  "UseFrontend": false,
  "UseMatchmaker": true, // Set to true if using Matchmaker.
  "UseHTTPS": false,
  "UseAuthentication": false,
  "LogToFile": true,
  "HomepageFile": "player.htm",
  "AdditionalRoutes": {},
  "EnableWebserver": true,
  "matchmakerAddress": "",
  "matchmakerPort": "9999", // The web socket port used to talk to the MM.
  "publicIp": "localhost", // The Public IP of the VM -- set by Terraform.
  "subscriptionId": "", // The Azure subscription -- set by Terraform.
  "resourceGroup": "", // Azure RG -- set by Terraform.
  "virtualMachineScaleSet": "", // Azure VMSS -- set by Terraform.
  "appInsightsId": "" // Azure App Insights ID for logging/metrics -- set by Terraform.
}

TURN / STUN Servers

In some cases, you might need a STUN / TURN server in between the UE4 app and the browser to help identify public IPs (STUN) or get around certain NAT'ing/Mobile carrier settings (TURN) that might not support WebRTC. Please refer to Unreal Engine's documentation for details about these options; however, for most users a STUN server should be sufficient. Inside of the SignallingWebServer\ folder there are PowerShell scripts used to spin up the Cirrus.js service which communicates between the user and the UE4 app over WebRTC, and Start_Azure_SignallingServer.ps1 or Start_Azure_WithTURN_SignallingServer.ps1 are used to launch with STUN / TURN options. Currently the Start_Azure_SignallingServer.ps1 file points to a public Google STUN server (stun.l.google.com:19302), but it's highly recommended to deploy your own for production. You can find many other public options online as well (e.g., 1, 2). Unreal Engine exports out stunserver.exe and turnserver.exe when packaging up the Pixel Streaming 3D app to setup on your own servers (not included in repo): \Engine\Source\ThirdParty\WebRTC\rev.23789\programs\Win64\VS2017\release\

Start_Azure_SignallingServer.ps1 is called by runAzure.bat when deploying the Terraform solution, so if a TURN server is needed this can be changed in runAzure.bat to call Start_Azure_WithTURN_SignallingServer.ps1 with the right TURN server credentials updated in the PS file.

Unreal 3D App

The Unreal 3D app and dependencies reside in GitHub (Git-LFS enabled) under the Unreal\ folder. The Unreal\ folder structure aligns with what is exported out of Unreal Engine, and below are the specific files\folders you will want to copy over the existing files provided in the example GitHub repository:

  1. Your exported <ProjectName>.exe should replace Unreal\PixelStreamingDemo.exe
  2. <ProjectName>\ folder associated with the <ProjectName>.exe should replace the Unreal\PixelStreaming\ folder.
  3. Important: Replace the Binaries folder of the repo with your Binaries folder that was generated when building your UE4 app (i.e., \Engine\Binaries\), as the third-party dlls and versions contained in the \Engine\Binaries\ThirdParty folder are specific to what was used in your 3D application. Due to licensing we are not able to include the .dlls in this repo, so it's important that you add them yourself. Make sure you then can click on your `.exe' to run it locally sucessfully in your cloned repo folder to ensure all dependencies are copied over. This is the only thing needed to be copied over from your own Engine\ folder to the repo.
  4. Nothing more is needed to copy over unless you've changed any player.htm or specific customizations to the MM or SS web servers. These changes must be merged with the Microsoft special customizations and not replaced over our WebServer\ files to ensure a correct merge.
  5. Important: Be sure to check in any code/app changes back into your forked repo as the Terraform deployment pulls from GitHub on your deployment and not your local resources.

The Unreal application has some key parameters that are passed in upon startup, which the Terraform deployment and PowerShell script (startVMSS.ps1) handles for you:

<PixelStreamingApp>.exe -AudioMixer -PixelStreamingIP=localhost -PixelStreamingPort=8888 -WinX=0 -WinY=0 -ResX=1920 -ResY=1080 -Windowed -RenderOffScreen -ForceRes

Notable app arguments to elaborate on for your understanding (see Unreal docs for others):

Autoscaling Configuration

Microsoft has added the ability to autoscale the 3D stream instances up and down, which is done from new logic added to the Matchmaker which evaluates a desired scaling policy and then scales the Virtual Machine Scale Set compute accordingly. This requires that the Matchmaker has a System Assigned Managed Service Identity (MSI) for the VM with permissions to scale up the assigned VMSS resource, which is setup for you already in the Terraform deployment. This eliminates the need to pass in special credentials to the Matchmaker such as a Service Principal, and the MSI is given Contributor access to the region's Resource Group that was created in the deployment—please adjust as needed per your security requirements.

Here are the key parameters in the Matchmaker config.json required to configure on autoscaling for the Signaling Server and 3D app (VMSS nodes). Important: Be sure to check in any config changes back into your forked repo as the Terraform deployment pulls from GitHub on your deployment and not your local resources.

instanceCountBuffer : Min amount of available streams before triggering a scale up (0 will ignore this). For instance, if you have 5 it will only trigger a scale up if only 4 or less streams are available.

percentBuffer : % of available streams before triggering a scale up (0 will ignore this). For instance, if you have 25 it will trigger a scale up if less than 25% of total connected Signaling Servers are available to stream.

idleMinutes : How many minutes of no new scale operations before considering a scale down (e.g., scale down after hours)

connectionIdleRatio : % of active streams to total instances that we want to trigger a scale down after idleMinutes passes.

minIdleInstanceCount : The number of VMSS nodes we want during an idle period (e.g., never go below 10 nodes)

maxInstanceScaleCount : The max number of VMSS nodes to scale out to (e.g., never scale above 250 VMs)

Player HTML & Custom Event Configuration

When Unreal Pixel Streaming is packaged from Unreal Engine the solution contains a \Engine\Source\Programs\PixelStreaming\WebServers\SignallingWebServer\player.htm file to customize the experience, along with the ability to customize JavaScript functions to send custom events between the browser and the 3D Unreal application. Please see Epic's robust documentation on how to make these extra customizations.

Deployment

This section will walk through all the steps necessary to deploy this solution in Azure. Currently the deployment expects a Windows OS as it references powershell.exe directly, though a simple symlink of pwsh to powershell.exe on Linux apparently works (will be added in a future release). Important: Be sure to first follow the guidance in the Configurations section to setup the git repo location.

To deploy the solution, use the steps here:

  1. <random_prefix>-global-unreal-rg : This stores all global resources such as the Traffic Manager, Key Vault and Application Insights.
  2. <random_prefix>-<region>-unreal-rg : This stores the Virtual Machine Scale Set (VMSS) for the GPU nodes that have the 3D app and Signaling Server, the Matchmaker VM and Virtual Network resources.

Testing the Deployment: Open up a web browser and paste in the DNS name from the Traffic Manager in the global Resource Group (e.g., http://<random_prefix>.trafficmanager.net) to be redirected to an available stream. The DNS name can be found under "DNS name: <link>" in the Overview page of the Traffic Manager resource in the Azure Portal. If you've deployed to multiple regions, you will be redirected to the closet Azure region.

Post the deployment there are processes that Terraform will run on the following solution components in each region upon startup of each VM:

Redeploying Updates

The easiest way to redeploy during the solution would be to do the following for each piece:

If we need to shut down the solution and start it up later, see below for the process. This is just shutting down the compute for the Matchmaker and the Signaling Servers, which are the costlier resources (especially the SS GPU VMs) vs. deleting all the resources and requiring a time-consuming redeployment.

Shutting down the core compute

Starting back up the core compute

Monitoring

Currently automated Azure dashboards aren't built when deploying the solution; however, outside of regular host metrics like CPU/Memory, some key metrics will be important to monitor in Azure Monitor/Application Insights such as:

View a tutorial on creating a dashboard in Azure Monitor here.

Supporting the Solution

In supporting the deployed solution, it is recommended to do a few key things:

Terraform

Below are the key files in the Terraform setup to understand when altering the code and tweaking the parameters.

Folder Structure

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

© 2021, Microsoft Corporation. All rights reserved