Open eddyxu opened 3 years ago
Currently, we have to load / create model after a pyspark session is created:
spark = SparkSession.builder. getOrCreate()
spark.sql("CREATE MODEL ...")
It loads the model definition as well as generated pandas udf into this spark session for every user.
A better UX would be that the Model Catalog is persisted somewhere, much like Hive metastore being persisted in an mysql / postgresql instance.
So that users can start the pyspark with the following configurations
spark = SparkSession.builder
.option("rikai.spark.model-catalog.class", "ai.eto.rikai.spark.model.mlflow.MlflowModelCatalog")
.option("rikai.spark.model-catalog.uri", "mlflow://host/path")
.getOrCreate()
And the models are ready to be used with ML_Predict
. This might need some deep refactoring on how the model is registered and run tho.
Similar to spark / hive's catalog. SQL ML's catalog can be backed by a persistent storage, i.e., a sql database (mysql/postgres).