databricks / mlops-stacks

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.
https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html
Apache License 2.0
416 stars 141 forks source link

Fix Monorepo Setup & Separate by Role #133

Closed arpitjasa-db closed 6 months ago

arpitjasa-db commented 7 months ago

Changes

This PR makes multiple changes in line with the Separate by Role Project Proposal to separate CICD generation from project generation to allow monorepos to work and making the division more intuitive:

Fixes https://github.com/databricks/mlops-stacks/issues/122

Also includes various quality-of-life improvements and minor bug fixes such as constant config versions, clean-up of unused variables, and proper file generation.

Tests

Project Generation Workflow:

Screenshot 2024-01-25 at 2 53 22 PM Screenshot 2024-01-26 at 2 55 57 AM Screenshot 2024-01-25 at 2 54 16 PM

CICD Zip Generation Workflow

Screenshot 2024-02-03 at 2 08 59 AM
niall-turbitt commented 7 months ago

testing this out and have taken the following workflow:

All works fine at this stage. I have a mono repo with 2 projects and CI/CD for the first project. I now want to add the CI/CD pipelines for the second project, and run through initialization again, selecting CICD_only, and setting the input_root_dir to the monorepo project root dir

What should be the intended workflow to add the CI/CD pipelines for a second project in a monorepo?

niall-turbitt commented 7 months ago

I'm getting this fs_param.json file generated for projects, when feature store is not selected: image

Can we remove this?

arpitjasa-db commented 7 months ago

I'm getting this fs_param.json file generated for projects, when feature store is not selected: image

Can we remove this?

Yeah good question, we could remove it and check for the presence/absence of the file instead in our workflows (but that does complicate the code a bit). Or we can rename is to project_params.json. What would be better?

niall-turbitt commented 7 months ago

in the generated top level README, it looks like we're generated an additional ( under the Setting up CI/CD section image

niall-turbitt commented 7 months ago

I'm getting this fs_param.json file generated for projects, when feature store is not selected: image Can we remove this?

Yeah good question, we could remove it and check for the presence/absence of the file instead in our workflows (but that does complicate the code a bit). Or we can rename is to project_params.json. What would be better?

project_params.json seems like a good option

mingyu89 commented 6 months ago

If I approve, please merge the PR shortly before the bugbash so that we can catch issues and fix them quickly.

arpitjasa-db commented 6 months ago

If I approve, please merge the PR shortly before the bugbash so that we can catch issues and fix them quickly.

Sure I can hold off on merging this till Monday morning, but I do also want to get the resources rename PR in before the bugbash

niall-turbitt commented 6 months ago

couple of minor things that need to be fixed in order to get the deploy cicd pipeline running for Azure DevOps. Aside from those, looks good. Thanks for this @arpitjasa-db , appreciate this was a tricky one to work through!

Set up a mono repo to test here https://dev.azure.com/niallturbitt/_git/monorepo-mlops-stacks - have added you as an admin Arpit

niall-turbitt commented 6 months ago

just catching this. For the second project I generated there were no staging or prod profiles defined in the databricks.yaml file. Understand the reasoning for this; a data scientist adding a second project may not know up front the staging/prod workspace URLs.

image

What this means though, is that when the CI/CD pipelines are triggered for the second project, there is not staging/prod profiles defined for the bundles CI/CD pipeline to use

image

We either need to somehow automate the addition of staging/test profiles to the databricks.yaml file of the new project whenever we are running the deploy cicd pipeline for the new project. Or direct the user to do so somewhere in documentation

arpitjasa-db commented 6 months ago

just catching this. For the second project I generated there were no staging or prod profiles defined in the databricks.yaml file. Understand the reasoning for this; a data scientist adding a second project may not know up front the staging/prod workspace URLs.

image

What this means though, is that when the CI/CD pipelines are triggered for the second project, there is not staging/prod profiles defined for the bundles CI/CD pipeline to use

image

We either need to somehow automate the addition of staging/test profiles to the databricks.yaml file of the new project whenever we are running the deploy cicd pipeline for the new project. Or direct the user to do so somewhere in documentation

@niall-turbitt I actually added the profile generation as part of the deploy CICD pipeline. Did it not work?

arpitjasa-db commented 6 months ago

@niall-turbitt I actually added the profile generation as part of the deploy CICD pipeline. Did it not work?

Ah actually I see we forgot to include that file when we did git add. It should be fixed now! Also made the other changes you had in the monorepo I think. Would you mind taking a final pass @niall-turbitt and see if all the issues you found were fixed? Thank you!

niall-turbitt commented 6 months ago

@arpitjasa-db the above comments I had added but didn't properly publish, so don't think you caught them. I see the refs/heads/feature-branch-pr branch issue has already been resolved.

The only other thing I think needs addressing is this point around documenting the permissions for the build service to be able to run the deploy CICD pipeline successfully

niall-turbitt commented 6 months ago

one other minor nit that I think could be confusing for users is the following in a monorepo setup:

From the default mono-mlops-stacks I would assume this param should be project-2 to set up for the new project. However this param has to be project_2 to map to the folder

arpitjasa-db commented 6 months ago

one other minor nit that I think could be confusing for users is the following in a monorepo setup:

  • I am adding a second project to a monorepo, and I specify the project name as project-2 during initialization
  • This is added under a folder project_2 in the monorepo
  • When running the deploy CICD pipeline to set up the CICD pipelines for project 2 we have to set the name of the project via a param:
parameters:
  - name: ProjectName
    displayName: Name of the project to deploy CI/CD for.
    default: mono-mlops-stacks
    type: string

From the default mono-mlops-stacks I would assume this param should be project-2 to set up for the new project. However this param has to be project_2 to map to the folder

@niall-turbitt good catch! Added a step in GH and ADO to create a underscore version of the project name and use that for the directory instead so they can continue using the normal name