sandervh14 commented 2 years ago

Add functionality to write away model metadata

It would be nice to have a function for writing away the date and time of each new modeling attempt, which variables were selected, which preprocessing was done and which was the resulting score.

Task Description

This could comprise:

store model metadata (scores, datetime, version etc) in a table
storage of files involved (model and preprocessor pickle, and potentially the data) on some filestorage or database blobs.

Provide the code for extracting this metadata, but allow a data scientist/engineer to write a plugin function to do the actual writing of the metadata to the database/filestore of choice.

sandervh14 commented 2 years ago

Hi @sborms, @ZlaTanskY, @nicolasmorandi and @c-morey!

Before we can finish implementing this issue entirely and have a better support for industrializing Cobra models, I think we need to discuss a few things.

Proposed use cases (and processes) throughout industiralization & model usage phase that we could support better in Cobra:

persisting the trained model:
- has been trained with the existing Cobra API
- persisting the preprocessor pipeline configuration as JSON file (1)
- persisting the model scores achieved (2)
- persisting the data (the basetable, containing all splits for reproducibility) (3)
model monitoring
- scoring the model every now and then, on new data (basetables) and reusing persisted Cobra models.
- detecting data drift - let's not re-invent the wheel and integrate with a solution online - Nicolas and I thought of trying out NannyML.
- persist the findings (model scores, data metrics) of the above model monitoring steps (2)
- persist the data (basetables) used for the above model monitoring steps (3)
- visualizing the model monitoring, most preferably in a dashboard (4)
retraining the model (proactive/reactive maintenance of the model): same steps as use case 1.
facilitate easy deployment & running: provide example python scripts for production-grade runnable Cobra models post-notebooking-phase, provide example Docker images/docker-compose files, ...

Integrations necessary for the above:

persisting files - see (1) above: we support writing to the local file system at the moment, but should consider supporting uploading to a server location instead, or as a blob in a database, etc.
persisting model scores - see (2) above:
persisting data - see (3) above: support databases both locally (MySQL, PostgreSQL) or in the cloud (BigQuery etc.)
visualizing model monitoring findings - see (4) above: integrate with PowerBI or other dashboarding software.

Additional task: documenting the thoughts above And any of your additions to it of course, the use cases and integrations of above that we support at the moment, and how to use them. While doing so, also fix the points listed in #133 (fix them here and close #133, or fix #133 separately and mention this issue as linked).

Am I missing interesting use cases or integrations above? Feel free to suggest.

Also: we cannot implement everything right now, and not even in the coming years, but must pick the most interesting things at each time, just adding the use cases and integrations just on-the-go as we are industrializing Cobra for clients with different demands and infrastructure.

I've also gathered the files from the Brico pull request and started structuring it a bit, so we can integrate their efforts into Cobra, see the draft pull request mentioned below on this page (only FYI). But I'd like to first discuss the above thoughts before proceeding on the gathered code, so we agree on what we want to do.

sandervh14 commented 1 year ago

Nicolás is interested in the investigation of MLFlow, that investigation could fit in this issue. See details on MLFlow's github, all 4 of the MLFlow components (Tracking for storing model parameters, Projects for reproducible runs, Models for easy deployment and Model registry to track model evolution throughout the model's lifecycle) are very interesting to build integrators for within Cobra. Up to you to decide @nicolasmorandi @pietrodantuono.

PythonPredictions / cobra

Improved industrialization support: persisting of model (configuration), reapeated model scoring, model monitoring #127

Add functionality to write away model metadata

Task Description