eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
135 stars 19 forks source link

Support persistent model catalog #214

Open eddyxu opened 3 years ago

eddyxu commented 3 years ago

Similar to spark / hive's catalog. SQL ML's catalog can be backed by a persistent storage, i.e., a sql database (mysql/postgres).

eddyxu commented 2 years ago

Currently, we have to load / create model after a pyspark session is created:


spark = SparkSession.builder. getOrCreate()
spark.sql("CREATE MODEL ...")

It loads the model definition as well as generated pandas udf into this spark session for every user.

A better UX would be that the Model Catalog is persisted somewhere, much like Hive metastore being persisted in an mysql / postgresql instance.

So that users can start the pyspark with the following configurations

spark = SparkSession.builder
    .option("rikai.spark.model-catalog.class", "ai.eto.rikai.spark.model.mlflow.MlflowModelCatalog")
    .option("rikai.spark.model-catalog.uri", "mlflow://host/path")
    .getOrCreate()

And the models are ready to be used with ML_Predict. This might need some deep refactoring on how the model is registered and run tho.