Serve a collection of custom models based on LRU

Description

Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.

Here are the different use cases that could be handled:

Thousands of models or just a few
All models fit into memory or not
A static list of models, or dynamic (e.g. point to an s3 prefix, and then you can add new models after the API is deployed)

Implementation

cron:

update tree
for each model in memory, unload it if not it not in tree
for each model in memory which has a latest timestamp: if there is a new version && (timestamp on latest is newer than oldest timestamp currently in cache or the cache has space): download it and load in memory

request:

if not in tree: option 1: error option 2: if in S3: update tree; else error
if not on disk: download model
if not in memory: load into memory
- if cache is too big, evict based on LRU
predict()

python:

user defines load_model(self, disk_path):

Open Questions

Where to put Python cache helper
How to unload models from memory? Anything special for GPU?
pre-download and/or pre-load during init()?

config questions

Should cron interval be configurable?
should the default have a cache size, or infinite (i.e. no eviction)?
model_dir would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the given model_dir path, within which multiple versions of the said model can be found.
The model disk cache size can be >= the model cache size (which resides in memory). A disk_model_cache_size field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be a model_cache_size field that would control the number of models that can be fit in memory at any point in time.
Should be able to point to the dynamic list (model_dir) or to a list of static models (models). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.
How should we handle OOM issues when making predictions? When a prediction is made, some memory has to be allocated for tensors, and this could exceed the available system memory (RAM or VRAM).

Notes

LRU memory cache and disk cache
volumes are not shared across replicas
threads_per_process > 1 is supported for TensorFlow and Python
processes_per_replica > 1 is not supported on Python, maybe supported on TensorFlow (if easy)
When serving, the requester may decide to use the latest version of a given model or a specific version of it (i.e. v1). If it's not specified, then resort to using the latest.
latest has its own timestamp, separate from each version. When evicting from the cache, the latest timestamp will be associated with the latest model (even though there may not be a timestamp associated with that model)

Additional Context

https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

cortexlabs / cortex

Serve a collection of custom models based on LRU #619