Closed vishalbollu closed 3 years ago
Grabbing this one. Part of the trick in making this work will be in reloading and unloading the model configs for the Tensorflow Predictor on-the-fly and reliably. I reckon things will be simpler for the ONNX and Python Predictors. This one goes hand-in-hand with #890.
Description
Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.
Here are the different use cases that could be handled:
Implementation
cron:
request:
python:
load_model(self, disk_path)
:Open Questions
init()
?config questions
model_dir
would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the givenmodel_dir
path, within which multiple versions of the said model can be found.disk_model_cache_size
field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be amodel_cache_size
field that would control the number of models that can be fit in memory at any point in time.model_dir
) or to a list of static models (models
). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.Notes
latest
version of a given model or a specific version of it (i.e.v1
). If it's not specified, then resort to using thelatest
.Additional Context