Cloud-and-Distributed-Systems / Erms

21 stars 4 forks source link

Artifact Evaluation (AE) for Erms

Overview

Erms is an efficient resource management framework that is mainly designed for shared microservice environments with SLA guarantees. It includes three modules, offline profiling, online scaling, and online scheduling as shown below. We evaluate Erms using DeathStarBench, including three applications, Social Network, Media Services and Hotel Reservation. Each application has one or more services; each service has multiple microservices.

Clusterarchitecture

Environment

We deploy Erms on top of a Kubernetes (k8s) cluster and use Jaeger and Prometheus to collect application-level and OS-level metrics, respectively.

Tips for users

Prepare for AE

We assume that you already have a kubernetes cluster that meets the following requirements.

In order to test Erms, you firstly need to modify the configuration in configs/*-global.yaml according to your environment. Usually, the most important configurations are:

You HAVE TO modify these two fields to let the test starts. You can also change other fields in the configuration files. You can check yaml files that started by _example to see each field's usage.

After that, you can use main.py to initialize the application, to let the program knows that which app you want to init, you can set the environment variable ERMS_APP, its possible values are: social, hotel and media.

You also need to update AE configurations in AE/scripts/utils.py, CONFIG variable. It's similar to those yaml files mentioned above.

Functional Evaluation for Erms

In this part, users can evaluate the function of Erms' each module separately. For the detail of each module, users could refer to Section 3 in the paper.

General Script Arguments

Most of the scripts support the following arguments, for more details of each script, please use -h to check the manual.

Offline Profiling

Erms use pair-wised linear functions to profile microservices performance. The profiling process takes more than two days. Thus, we provide collected traces. If users want to profile applications by themself, please check here for more details.

Notes: The generated file spanRelationships.csv needs to add one more column called step manuaully. It is used to distinguish between sequential and parallel invokes. For all invokes that belong to the same parent, if they have the same step, they will be treated as parallel invokes; otherwise, they will be considered sequential invokes. A simple example is provided below:

Screenshot from Jaeger

You can refer to #11 for more information.

Online Scaling

Online Scaling determines how many containers are allocated to each microservices, which includes three parts, i.e., dependency merge, latency target computation and priority-based scheduling. Users can evaluate the end-to-end Online Scaling module or evaluate three parts separately. The output will be printed on the terminal.

Dynamic Provisioning (Online Scheduling)

With the results of online scaling (i.e., the number of allocated containers for each microservices), the dynamic provisioning module generates schedule policies that assign the allocated containers to different nodes to balance the interference across the cluster.

Run

./AE/scripts/dynamic-provisioning.sh

and check the printed result on the terminal.

DynamicProvisioning

Reproducible Evaluation for Erms

In this part, users could reproduce the experimental results in the paper. We repeated each evaluation five times and adopted the median of latency to mitigate the variance in latency.

Please note that some scripts may print error messages similart to the following:

Error from server (NotFound): error when deleting "tmp/scheduledAPP/cast-info-service_Deployment.yaml": deployments.apps "cast-info-service" not found

To initialize the running environment, we will kill the application pods first before running the applications. If there are no such pods, Kubernetes system will throw an error. This error can be ignored and the script will keep on going.


Microservice Profiling Accuracy

Resource Efficiency and Performance

Evaluate Erms under static workload:

# Generate theoretical result
bash ./AE/scripts/theoretical-resource.sh
# Generate experiment result
bash ./AE/scripts/static-workload.sh

Notes: This process may take about 6 hours.

Evaluate Erms under dynamic workload:

bash ./AE/scripts/dynamic-workload.sh

Notes: This process may take about 20 hours.


Evaluation of Different Modules

Benefit of Priority Scheduling:

bash ./AE/scripts/benefit-priority-scheduling.sh

Benefit of Interference-based Scheduling:

bash ./AE/scripts/interference-scheduling.sh

Notes: This process may take about 1 day.


How to reuse Erms beyond the paper

In this part, we introduce some tips about how to reuse Erms.

  1. The project is separated into different modules. Users could modify each individual module to build their own systems. For example, users can modify latency target computation to design a new algorithm for resource allocation.
  2. We use Yaml, which is easier for people to read, to configure the argument for Erms. Users could revise the Yaml file instead of the code to run Erms easily under different configurations.

Application profiling

In Functional Evaluation part, we provide a script to demonstrate the profiling functionality of Erms. However, if users want to profile applications comprehensively, they need to perform a much longer profiling process with more complicated configurations.

Here we provide a guide that helps you to complete the profiling process.