Ekya is a system which enables continuous learning on resource constrained devices. Given a set of video streams and pre-trained models, Ekya can continuously fine-tune the models to maximize accuracy by intelligently allocating resources between live inference and retraining in the background.
At the core of Ekya is the Thief Scheduler, which operates by stealing small resource chunks from a selected job and reallocating them to a more promising job. The thief scheduler obtains information about the "promise" of a job through the micro-profiling mechanism, which runs each retraining job for a short duration to estimate it's future performance.
This architecture diagram highlights the flow of data in Ekya. More details can be found in our USENIX NSDI 2022 paper available here.
With this release of Ekya, you can:
As a part of this repository, we present two new video datasets - Urban Traffic and Urban Building. In addition, Ekya can also run on the Cityscapes and Waymo datasets (see instructions below).
We have labelled both Urban Traffic and Urban Building datasets using our golden model (ResNeXT-101 trained on MS COCO). These labels are stored in files called samplelists.
The samplelists for each video clip, containing the objects detected and their labels can be found in the samplelists
directory in the dataset folder.
Each samplelist is a CSV with 6 columns: ["idx", "class", "x0", "y0", "x1", "y1"]
. Each column is described below:
idx
: row indexclass
: Object class - mapping can be found in /ekya/datasets/coco_classes.txt
.x0
: X coordinates of top left of bounding boxy0
: Y coordinates of top left of bounding boxx1
: X coordinates of bottom right of bounding boxy1
: Y coordinates of bottom right of bounding boxOrigin for the image is at the top right of the video frame.
The Bellevue Traffic Video Dataset contains 62GB of traffic videos recorded from five pole mounted fish-eye cameras in the city of Bellevue, WA. Each video stream is recorded at 1280x720@30fps, for a total of 101 hour of video across all cameras.
The dataset can be downloaded here. We also prepared combined labels and cropped objects for Ekya to run on the dataset.
This dataset contains 24 hours of video recorded from a PTZ public camera with a non-stationary view in Las Vegas. The video is recorded at 1920x1080@0.2fps. Along with the video stream, we provide the labels in the samplelist format described above.
The dataset, object labels and cropped images of objects can be downloaded here.
cf53b351471716e7bfa71d36368ebea9b0e219c5
(Ray 0.9.0.dev0
) from the Ray repository.
pip install ray
is not sufficient.
git clone https://github.com/ray-project/ray/
cd ray
git checkout cf53b35
sudo apt-get update
sudo apt-get install -y build-essential curl unzip psmisc ffmpeg
pip install cython==0.29.0 pytest
ray/ci/travis/install-bazel.sh
set BAZEL_SH=C:\Program Files\Git\bin\bash.exe
)pushd ray/dashboard/client npm install npm run build popd
cd ray/python pip install -e . --verbose # Add --user if you see a permission denied error.
3. After installing ray, clone the Ekya repository and install Ekya.
git clone https://github.com/edge-video-services/ekya/ pip install -e . --verbose
4. Install [Nvidia Multiprocess Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html).
sudo apt-get update sudo apt-get install nvidia-cuda-mps
5. Set your GPU to run in exclusive process mode and run Nvidia MPS daemon. This will require killing Xserver if it is running.
export CUDA_VISIBLE_DEVICES="0" nvidia-smi -i 2 -c EXCLUSIVE_PROCESS nvidia-cuda-mps-control -d
**NOTE**: Starting version `410.74`, Nvidia MPS does not necessarily honor GPU resource allocation for tasks. Please use version `392` or lower.
## Preparing Models
### Golden Model
The golden model is used to generate image classification groundtruth in Ekya.
Please download resnext101 elastic model from
[here](https://github.com/allenai/elastic) into ```ekya/golden_model/```.
### Object Detection Model
The object detection model is used to identify objects from video frames.
Please download ```faster_rcnn_resnet101_coco``` from
[here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf1_detection_zoo.md)
into ```ekya/object_detection_model/```.
## Running Ekya with Cityscapes Dataset
### Preprocessing the Cityscapes Dataset
1. Download the Cityscapes dataset using the instructions on the [website](https://www.cityscapes-dataset.com/) and extract the `leftImg8bit` and `gtFine` subdirectories from `leftImg8bit_trainvaltest.zip` and `gtFine_trainvaltest.zip` to a dataset directory.
2. Generate the samplelists by running
```bash
cd ekya/datasets/scripts
python cityscapes_generate_sample_lists.py --root <path to your cityscapes root>
Download the pretrained models for citysapes from here and extract them to a directory.
Run the multicity training script provided with Ekya.
cd ekya/experiment_drivers/
./driver_multicity.sh
You may need to modify DATASET_PATH
and MODEL_PATH
to point to your dataset and pretrained models dir, respectively. You must also set NUM_GPUS
to reflect the number of GPUs to use.
This script will run all schedulers, including thief
, fair
and noretrain
.
The results will be written in a timestamped directory at results/ekya_expts/cityscapes/
.
waymo_pretrain_model.tar
from
here
into pretrained_models
.cd ekya/pretrained_models
tar -xvf waymo_pretrain_model.tar
mv waymo_pretrain_model waymo
rm waymo_pretrain_model.tar
waymo_classification_images.tar
from
here
into dataset/waymo
.cd dataset/waymo
tar -xvf waymo_classification_images.tar
rm waymo_classification_images.tar
cd ekya/datasets/scripts
python waymo_generate_sample_lists.py --root ../../../dataset/waymo/tfrecord --save-dir ../../../dataset/waymo
cd ekya/experiment_drivers
bash driver_profiling_waymo_golden.sh
bellevue_pretrained_models.tar.gz
from
here
into pretrained_models
.cd pretrained_models
tar -xvf bellevue_pretrained_models.tar.gz
mv bellevue_pretrained_models bellevue
rm bellevue_pretrained_models.tar.gz
cd ekya/experiment_drivers
python driver_prepare_mp4.py \
--dataset bellevue \
--dataset-root ../../dataset \
--device 0 \
--model-path ../../object_detection_model/faster_rcnn_resnet101_coco_2018_01_28
cd ekya/experiment_drivers
bash driver_profiling_mp4_golden_vegas.sh
vegas_pretrained_models.tar.gz
from
here
into pretrained_models
.cd pretrained_models
tar -xvf vegas_pretrained_models.tar.gz
mv vegas_pretrained_models vegas
rm vegas_pretrained_models.tar.gz
Download las_vegas_24h_[0-3].tar.gz
and
vegas_sample_lists.tar.gz
from
here
into datasets/vegas
.
Decompress
cd datasets/vegas
for i in {0..3}; do tar -xf las_vegas_24h_$i.tar.gz; done
tar -xf vegas_sample_lists.tar.gz
rm *.tar.gz
las_vegas_24h_[0-3].mp4
here
into datasets/vegas
.cd ekya/experiment_drivers
python driver_prepare_mp4.py \
--dataset vegas \
--dataset-root ../../dataset \
--device 0 \
--model-path ../../object_detection_model/faster_rcnn_resnet101_coco_2018_01_28
cd ekya/experiment_drivers
bash driver_profiling_mp4_golden_bellevue.sh
To plot the results from the above runs,have done so, collect all the result directories. You can then use the /viz/driver_viz_multicity_varyingcities.ipynb
notebook to plot your results. You will need to set the BASE_DIR
to the root of your Ekya log directory.
For example, if you run the cityscapes driver script with the default values (defaults are set for a shorter run), you should be able to produce the following figure:
To create figures with varying GPU counts, you will need to run the driver script for different NUM_GPUS
counts and collate them into one directory before using /viz/driver_viz_multicity_varyingcities.ipynb
.
One of the baselines explored in our NSDI paper is the comparison against a continous model selection strategy. This baseline strategy uses pre-cached models generated under different scenarios (e.g. weather, time of day, class distributions) and loads models according to the current scenario. As we demonstrate in the paper, Ekya outperforms this strategy:
To run these baselines, follow these steps:
# assume waymo dataset is ready
cd ekya/model_cache
# to train models used in the model cache experiments
bash driver_model_cache.sh
# to do the inference
bash driver.sh
# to plot figures
python plot.py
Ekya can be easily extended in two dimensions - adding custom schedulers and adding new continuous learning techniques.
Ekya schedulers are implemented in ekya/schedulers/
. Any new scheduler must extend the Scheduler base class in scheduler.py
.
The BaseScheduler
class implements two key methods - reallocation_callback
and get_inference_schedule
. Their method signature and usage is described below.
class BaseScheduler(object):
def reallocation_callback(self,
completed_camera_name: str,
inference_resource_weights: dict,
training_resources_weights: dict) -> [dict, dict]:
'''
This callback is called when a training job completes. This provides the scheduler an opportunity to reconfigure
resource allocations for jobs. Currently, only changes to to the inference resources are reflected
(because updating training jobs would require process restarts, an expensive operation).
:param completed_camera_name: str, name of the job completed
:param inference_resource_weights: the current inference resource allocation
:param training_resources_weights: the current training resource allocation
:return: new_inference_resource_weights, new_training_resources_weights, two dictionaries mapping resource weights for inference and training jobs.
'''
pass
def get_inference_schedule(self,
cameras: List[Camera],
resources: float):
'''
Returns the schedule when inference only jobs must be run. This must be super fast since this is the schedule
used before the get_schedule actual schedule is obtained.
:param cameras: list of cameras
:param resources: total resources in the system to be split across tasks
:return: inference resource weights, hyperparameters to use inference.
'''
pass
Currently Ekya uses simple gradient updates to update vision models for each camera running in the system. This repository also includes another incremental learning technique ICaRL (CVPR 2017) using this implementation.
To add your own learning technique:
ekya.classes.MLModel
baseclass. /ekya/classes
./ekya/classes/model.py
, replace MLModel
with your Model in RayMLModel = ray.remote(num_gpus=0.01)(<model>)
When installing ray with pip install -e . --verbose
and encountering the
error "[ray] [bazel] build failure, error --experimental_ui_deduplicate unrecognized"
.
Please checkout this
issue. If other versions
of bazel
are installed, please install bazel-3.2.0
following instructions
from
here
and compile ray useing bazel-3.2.0
.
When installing Ekya with pip install -e . --verbose
and the following
error (tensorflow version issue) shows up, it can be resolved by running
pip install -e . --verbose --use-feature=2020-resolver
.
ERROR: After October 2020 you may experience errors when installing or
updating packages. This is because pip will change the way that it resolves
dependency conflicts. We recommend you use --use-feature=2020-resolver to
test your packages with the new resolver before it becomes the default.
tensorflow-gpu 2.2.0 requires gast==0.3.3, but you'll have gast 0.2.2 which
is incompatible. tensorflow-gpu 2.2.0 requires tensorboard<2.3.0,>=2.2.0,
but you'll have tensorboard 2.1.1 which is incompatible. tensorflow-gpu
2.2.0 requires tensorflow-estimator<2.3.0,>=2.2.0, but you'll have
tensorflow-estimator 2.1.0 which is incompatible. ekya 0.0.1 requires
tensorflow==2.2.0, but you'll have tensorflow 2.1.0 which is incompatible.
usage: Ekya [-h] [-ld LOG_DIR] [-retp RETRAINING_PERIOD]
[-infc INFERENCE_CHUNKS] [-numgpus NUM_GPUS]
[-memgpu GPU_MEMORY] [-r ROOT]
[--dataset-name {cityscapes,waymo}] [-c CITIES]
[-lpt LISTS_PRETRAINED] [-lp LISTS_ROOT] [-dc]
[-ir RESIZE_RES] [-w NUM_WORKERS] [-ts TRAIN_SPLIT]
[-dtfs] [-hw HISTORY_WEIGHT] [-rp RESTORE_PATH]
[-cp CHECKPOINT_PATH] [-mn MODEL_NAME] [-nc NUM_CLASSES]
[-b BATCH_SIZE] [-lr LEARNING_RATE] [-mom MOMENTUM]
[-e EPOCHS] [-nh NUM_HIDDEN] [-dllo] [-sched SCHEDULER]
[-usp UTILITYSIM_SCHEDULE_PATH]
[-usc UTILITYSIM_SCHEDULE_KEY] [-mpd MICROPROFILE_DEVICE]
[-mprpt MICROPROFILE_RESOURCES_PER_TRIAL]
[-mpe MICROPROFILE_EPOCHS]
[-mpsr MICROPROFILE_SUBSAMPLE_RATE]
[-mpep MICROPROFILE_PROFILING_EPOCHS]
[-fswt FAIR_INFERENCE_WEIGHT] [-nt NUM_TASKS]
[-stid START_TASK] [-ttid TERMINATION_TASK]
[-nsp NUM_SUBPROFILES] [-op RESULTS_PATH] [-uhp HYPS_PATH]
[-hpid HYPERPARAMETER_ID] [-pm] [-pp PROFILE_WRITE_PATH]
[-ipp INFERENCE_PROFILE_PATH]
[-mir MAX_INFERENCE_RESOURCES]
Ekya driver script for cityscapes dataset. Uses pretrained models to improve accuracy over time.
optional arguments:
-h, --help show this help message and exit
-ld LOG_DIR, --log-dir LOG_DIR
Directory to log results to
-retp RETRAINING_PERIOD, --retraining-period RETRAINING_PERIOD
Retraining period in seconds
-infc INFERENCE_CHUNKS, --inference-chunks INFERENCE_CHUNKS
Number of inference chunks per retraining window.
-numgpus NUM_GPUS, --num-gpus NUM_GPUS
Number of GPUs to partition.
-memgpu GPU_MEMORY, --gpu-memory GPU_MEMORY
Per GPU Memory in GB.
-r ROOT, --root ROOT Path to cityscapes dataset root.
--dataset-name {cityscapes,waymo}
Name of the dataset supported.
-c CITIES, --cities CITIES
comma separated str of list of cities to create
cameras. Num cameras = num of cities
-lpt LISTS_PRETRAINED, --lists-pretrained LISTS_PRETRAINED
comma separated str of lists used for training the
pretrained model. Used as history for continuing the
retraining. Usually frankfurt,munster.
-lp LISTS_ROOT, --lists-root LISTS_ROOT
root of sample lists. This must be downloaded from ekya repo.
-dc, --use-data-cache
Use data caching for cityscapes. WARNING: Might
consume lot of disk space.
-ir RESIZE_RES, --resize-res RESIZE_RES
Image size to use for cityscapes.
-w NUM_WORKERS, --num-workers NUM_WORKERS
Number of workers preprocessing the data.
-ts TRAIN_SPLIT, --train-split TRAIN_SPLIT
Train validation split. This float is the fraction of
data used for training, rest goes to validation.
-dtfs, --do-not-train-from-scratch
Do not train from scratch for every profiling task -
carry forward the previous model
-hw HISTORY_WEIGHT, --history-weight HISTORY_WEIGHT
Weight to assign to historical samples when
retraining. Between 0-1. Cannot be zero. -1 if no
reweighting.
-rp RESTORE_PATH, --restore-path RESTORE_PATH
Path to the pretrained models to use for init. Must be
downloaded from Ekya repo.
-cp CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH
Path where to save the model
-mn MODEL_NAME, --model-name MODEL_NAME
Model name. Can be resnetXX for now.
-nc NUM_CLASSES, --num-classes NUM_CLASSES
Number of classes per task.
-b BATCH_SIZE, --batch-size BATCH_SIZE
Batch size.
-lr LEARNING_RATE, --learning-rate LEARNING_RATE
Learning rate.
-mom MOMENTUM, --momentum MOMENTUM
Momentum.
-e EPOCHS, --epochs EPOCHS
Number of epochs per task.
-nh NUM_HIDDEN, --num-hidden NUM_HIDDEN
Number of neurons in hidden layer.
-dllo, --disable-last-layer-only
Adjust weights on all layers, instead of modifying
just last layer.
-sched SCHEDULER, --scheduler SCHEDULER
Scheduler to use. Either of fair, noretrain, thief,
utilitysim.
-usp UTILITYSIM_SCHEDULE_PATH, --utilitysim-schedule-path UTILITYSIM_SCHEDULE_PATH
Path to the schedule (period allocation) generated by
utilitysim.
-usc UTILITYSIM_SCHEDULE_KEY, --utilitysim-schedule-key UTILITYSIM_SCHEDULE_KEY
The top level key in the schedule json. Usually of the
format {}_{}_{}_{}.format(period,res_count,scheduler,u
se_oracle)
-mpd MICROPROFILE_DEVICE, --microprofile-device MICROPROFILE_DEVICE
Device to microprofile on - either of cuda, cpu or
auto
-mprpt MICROPROFILE_RESOURCES_PER_TRIAL, --microprofile-resources-per-trial MICROPROFILE_RESOURCES_PER_TRIAL
Resources required per trial in microprofiling. Reduce
this to run multiple jobs in together while
microprofiling. Warning: may cause OOM error if too
many run together.
-mpe MICROPROFILE_EPOCHS, --microprofile-epochs MICROPROFILE_EPOCHS
Epochs to run microprofiling for.
-mpsr MICROPROFILE_SUBSAMPLE_RATE, --microprofile-subsample-rate MICROPROFILE_SUBSAMPLE_RATE
Subsampling rate while microprofiling.
-mpep MICROPROFILE_PROFILING_EPOCHS, --microprofile-profiling-epochs MICROPROFILE_PROFILING_EPOCHS
Epochs to generate profiles for, per hyperparameter.
-fswt FAIR_INFERENCE_WEIGHT, --fair-inference-weight FAIR_INFERENCE_WEIGHT
Weight to allocate for inference in the fair
scheduler.
-nt NUM_TASKS, --num-tasks NUM_TASKS
Number of tasks to split each dataset into
-stid START_TASK, --start-task START_TASK
Task id to start at.
-ttid TERMINATION_TASK, --termination-task TERMINATION_TASK
Task id to end the Ekya loop at. -1 runs all tasks.
-nsp NUM_SUBPROFILES, --num-subprofiles NUM_SUBPROFILES
Number of tasks to split each dataset into
-op RESULTS_PATH, --results-path RESULTS_PATH
The josn file to write results to.
-uhp HYPS_PATH, --hyps-path HYPS_PATH
hyp_map.json path which lists the hyperparameter_id to
hyperparameter mapping.
-hpid HYPERPARAMETER_ID, --hyperparameter-id HYPERPARAMETER_ID
Hyperparameter id to use for retraining. From hyps-
path json.
-pm, --profiling-mode
Run in profiling mode?
-pp PROFILE_WRITE_PATH, --profile-write-path PROFILE_WRITE_PATH
Run in profiling mode?
-ipp INFERENCE_PROFILE_PATH, --inference-profile-path INFERENCE_PROFILE_PATH
Path to the inference profiles csv
-mir MAX_INFERENCE_RESOURCES, --max-inference-resources MAX_INFERENCE_RESOURCES
Maximum resources required for inference. Acts as a
ceiling for the inference scaling function.
If you use Ekya or the new datasets in your research, please cite the Ekya NSDI 2022 paper:
@inproceedings {276952,
author={Romil Bhardwaj and Zhengxu Xia and Ganesh Ananthanarayanan and Junchen Jiang and Yuanchao Shu and Nikolaos Karianakis and Kevin Hsieh and Paramvir Bahl and Ion Stoica}
title = {{Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers}},
booktitle = {USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
address = {Renton, WA},
url = {https://www.usenix.org/conference/nsdi22/presentation/bhardwaj},
publisher = {USENIX Association},
month = apr,
}