databricks / mlops-stacks

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.
https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html
Apache License 2.0
416 stars 141 forks source link

Can't register a new version of a model #135

Closed Eiley2 closed 7 months ago

Eiley2 commented 7 months ago

Hi there,

I'm currently trying to implement this stack at my workplace and facing an issue that I'd like to understand if I'm doing something wrong or if it's a configuration error.

Since we only have 2 workspaces, one for prod and the other for QA and development, with a shared Unity catalog, my idea was to configure the Unity catalog with the name "ml-ops." Then, in the schema, use the model's name, in this case, "prometheus," and within the registry of each schema for each model, name the models as follows: prod-prometheus-model, staging-prometheus-model, and dev-prometheus-model.

For this, I made the following modifications in the code of the following files:

ml-artifacts-asset.yml

resources:
  registered_models:
    model:
      name: ${bundle.target}-${var.model_name}
      catalog_name: ml-ops
      schema_name: prometheus-model
      <<: *grants
      depends_on:
        - resources.jobs.model_training_job.id
        - resources.jobs.batch_inference_job.id

databricks.yml

bundle:
  name: ${bundle.target}-${var.model_name}

variables:
  experiment_name:
    description: Experiment name for model training.
    default: /Users/${workspace.current_user.userName}/${bundle.target}-prometheus-experiment
  model_name:
    description: Model name for model training.
    default: prometheus-model

model-workflow-asset.yml

resources:
  jobs:
    model_training_job:
      name: ${bundle.target}-${var.model_name}-model-training-job
      job_clusters:
        - job_cluster_key: model_training_job_cluster
          <<: *new_cluster
      tasks:
        - task_key: Train
          job_cluster_key: model_training_job_cluster
          notebook_task:
            notebook_path: ../training/notebooks/Train.py
            base_parameters:
              env: ${bundle.target}
              # TODO: Update training_data_path
              training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
              experiment_name: ${var.experiment_name}
              model_name: ml-ops.${var.model_name}.${bundle.target}-${var.model_name}
              # git source information of the current ML asset deployment. It will be persisted as part of the workflow run
              git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}

However, after the first deployment, when the CI/CD pipeline runs:

databricks bundle deploy -t staging

I get the following error:

Updating deployment state...
Error: terraform apply: exit status 1

Error: cannot create registered model: Function or Model 'ml-ops.prometheus-model.staging-prometheus-model' already exists

  with databricks_registered_model.model,
  on bundle.tf.json line 188, in resource.databricks_registered_model.model:
 188:       }

Besides that, everything runs perfectly, and I can serve the model without a trouble. Also, I'm using the demo model. I haven't implemented our own yet.

I'm not sure if I'm doing something wrong. Any guidance would be appreciated.

Eiley2 commented 7 months ago

If it's worth something this is the requirements.txt I'm using:

mlflow==2.7.1
numpy>=1.23.0
pandas>=1.4.3
scikit-learn>=1.1.1
matplotlib>=3.5.2
Jinja2==3.0.3
pyspark~=3.3.0
pytz~=2022.2.1
arpitjasa-db commented 7 months ago

@Eiley2 thanks for opening this issue! A few things:

  1. The error says that the model already exists. Would you mind confirming if it does, and if so, delete the model if safe? This error prevents accidental overwrite of models that already exist when using bundle deploy.
  2. Above you said the schema name is "prometheus" but in the code I see the schema name is "prometheus-model". Is that intentional?
  3. I wouldn't rename the bundle: name: in databricks.yml to something variable dependent since that is the name of the entire bundle as a whole. By default, we use the project name for the bundle name and would recommend doing that as well to prevent unintended behavior down the line.
Eiley2 commented 7 months ago

Thanks for your answer @arpitjasa-db !

  1. Yeah it does. When I delete it and deploy it for first time and run databricks bundle deploy -t dev/staging/prod it'll work perfectly, but when the github actions deploy it again to refresh the code, it throws the error saying that it exists already. image
  2. Yeah, sorry, in the picture you can see it's correct the schema.
  3. Gotcha, will rename it to prometheus.
Eiley2 commented 7 months ago

I just deleted the prod version. And ran databricks bundle deploy -t prod and it went okay. If I run it again with some change, it says it's already deployed. Is this not the right way to deplyo a change? Sorry if I'm not getting it right.

image

arpitjasa-db commented 7 months ago

Oh you're not using the CI/CD workflows for deployment? Either way this should be working. What is supposed to happen is when you do databricks bundle deploy it will deploy the resources and mark them as having come from this bundle using a state, so subsequent deploys will check that state and only deploy necessary resources, overwriting as necessary.

What seems to be happening is after deploying, the CLI is not recognizing that this resource was created from this bundle but thinks it was created elsewhere, which is why it fails with that error I mentioned above for safety.

Are you running the command from the same directory each time? If so, would you mind opening the .bundle/ subdirectory that was created in that directory and listing out all the contents in its subdirectories?

Eiley2 commented 7 months ago

Your suggestion got me thinking, so I went ahead and deleted the .databricks folder with everything within it, and it did the trick!

Looks like when I was deploying from the CI/CD workflows and I tried debugging, somehow I combined my credentials in my terminal and the service account credentials when I tried to deploy. After deleting the folder the bundle was created from scratch and the CLI recognized it came from this bundle so it solved the error. Thanks for your help!

arpitjasa-db commented 7 months ago

Awesome glad to hear it!