databricks-industry-solutions / auto-data-linkage

Low effort linking and easy de-duplication. Databricks ARC provides a simple, automated, lakehouse integrated entity resolution solution for intra and inter data linking.
https://databricks-industry-solutions.github.io/auto-data-linkage/
Other
46 stars 20 forks source link

Inference #61

Open spookrunner opened 1 year ago

spookrunner commented 1 year ago

Has there been any consideration about adding support for inference via a registered model on Databricks?

arabe91 commented 11 months ago

any updates here on that?

robertwhiffin commented 9 months ago

@spookrunner @arabe91 would you elaborate please? Would you like to see the model served as API to provide pairwise predictions, or something else?

spookrunner commented 9 months ago

@robertwhiffin yes exactly...enabling the code to be registered as a formal model and then inferred as such (either via batch or API) rather than embedding the code directly in a workflow somewhere

robertwhiffin commented 9 months ago

This is a good idea. It will be added to the backlog. Thanks!

On Thu, 14 Dec 2023 at 17:49, spookrunner @.***> wrote:

@robertwhiffin https://github.com/robertwhiffin yes exactly...enabling the code to be registered as a formal model and then inferred as such (either via batch or API) rather than embedding the code directly in a workflow somewhere

— Reply to this email directly, view it on GitHub https://github.com/databricks-industry-solutions/auto-data-linkage/issues/61#issuecomment-1856319508, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATYH4EEY5USQY7LAGW7BEGDYJM325AVCNFSM6AAAAAA3DP7NS2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGMYTSNJQHA . You are receiving this because you were mentioned.Message ID: <databricks-industry-solutions/auto-data-linkage/issues/61/1856319508@ github.com>

jericksonclinicaloptions commented 8 months ago

Are there any updates on this issue?

I was able to get the arc training up and running in a day, but I am stuck on how to retrieve a trained model from MLFlow and make predictions on new datasets for the same schema. The documentation and examples seem to be focused on training the model.

Is there an example to follow to retrieve a previously logged model and make predictions using the existing code?

robertwhiffin commented 8 months ago

This is a WIP - the model can currently be retrieved from MLFlow and deployed, but it's not pretty. The next version will make this nicer. Something like this should work

import mlflow
import arc
from splink.spark.linker import SparkLinker

logged_model = 'runs:/d540c25de5f342db80ff7e8ceb512bff/linker'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)
arc_linker = loaded_model.unwrap_python_model()
linker = SparkLinker(link_data, spark=spark)
linker.load_settings(arc_linker.settings)
predictions = linker.predict()

the current predict method doesn't work, so we need to extract the settings which define the underlying splink model and build a splink linker.

API support will be a longer term issue

jericksonclinicaloptions commented 8 months ago

@robertwhiffin Thanks! This is what I needed, I was getting stuck because the current SplinkMLFlowWrapper.predict throws an error, I think because it is using the deprecated linker.initialise_settings instead of linker.load_settings.

Looking forward to the next version!