databricks / databricks-asset-bundles-dais2023

Other
47 stars 57 forks source link

Upload jar library without syncing other files/folders #9

Open MrJSmeets opened 1 year ago

MrJSmeets commented 1 year ago

Hi,

I would like to upload an existing JAR as a dependent library to a job/workflow without having to sync any other files/folders. Currently, all files/folders are always synchronized, but I don't want to sync these. I only need the jar in the target/scala-2.12 folder.

sync:
  include:
    - target/scala-2.12/*.jar

Folder structure:

.
├── README.md
├── build.sbt
├── databricks.yml
├── src
│   └── main
│       ├── resources
│       │   └── ...
│       └── scala
│           └── ...
└── target
    ├── global-logging
    ├── scala-2.12
        └── xxxxxxxxx-assembly-x.x.x.jar

With dbx, this was possible by using file references. What is the recommended way to do this via DAB, without syncing other files/folders?

I expected this to be possible via artifacts, but that seems to be (for now?) only intended for Python wheels.

andrewnester commented 1 year ago

By default, DABs exclude files and folders from syncing based on .gitgnore file if you're using Git. If you're not using Git, or don't want to include certain files in .gitignore you can use sync.exclude property.

sync:
  exclude:
    - src/**/*
    - databricks.yml
    - build.sbt
    - target/global-logging/*
MrJSmeets commented 1 year ago

Thanks andrewnester, then it seems that uploading a jar via this synchronisation method is not really the right way for Scala projects. I will instead upload my jar to adls/s3 and put the databricks.yaml file in a subfolder so I don't have to clutter my job definitions with this list of excludes.

Hopefully something similar to the file references at dbx will be available in the future. That made it very useful to upload a JAR with the job definition during local development.

mike-smith-bb commented 6 months ago

Have there been any updates on this feature? We are also struggling to manage deployments of jar files as part of the DAB deployment. It doesn't seem to work via include because there isn't support for the jar file in artifacts and it complains about not having a relevant artifact specification.

andrewnester commented 6 months ago

@mike-smith-bb DABs already supports building and automatic upload of JARs, so the configuration can look somewhat like this

artifacts:
  my_java_project:
    path: ./path/to/project
    build: "sbt package"
    type: jar
    files:
      - source: ./path/to/project/targets/*.jar

Note that you have to explicitly specify files source section to point where built jars are located

Also please make sure you're using latest CLI version (0.217.1 as of now).

If you still experience any issues feel free to open issue in CLI repo here https://github.com/databricks/cli/issues

mike-smith-bb commented 6 months ago

Thanks, @andrewnester . Your suggestion, I think, assumes that we are building the artifact as part of the DAB deployment. What if the jar file is built by a different process and we are simply trying to include it as part of the job cluster created and want to store it in the DAB structure. Is this supported?

andrewnester commented 6 months ago

@mike-smith-bb yes, indeed.

Then just using sync include section should work, does it for you?

sync:
  include:
    - target/scala-2.12/**/*.jar

Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need

jmatias-gilead commented 6 months ago
sync:
  include:
    - target/scala-2.12/**/*.jar

Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need

@andrewnester — This doesn't seem to work with jar files. Even if I sync the file like you showed, I can't add those jar files as dependencies.

If I do

sync:
  include:
    - resources/lib/*
...
resources:
  jobs:
    my_job:
      name: My Job
      tasks:
        - task_key: mytask
          notebook_task:
            notebook_path: ../src/mymodule/myfile.py
          job_cluster_key: job_cluster
          libraries:
            - jar: /Workspace/${workspace.file_path}/resources/lib/my_custom.jar

I get this error:

image

I'm assuming it's because of this:

image

Do we have any options available to add a jar dependency from source like we used to do with dbx?

mike-smith-bb commented 6 months ago

We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.

jmatias commented 6 months ago

We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.

Ok, I will do that in the meantime and see how that goes.

fernanluyano commented 1 month ago

I'm having the same issue. I have a Databricks Volume for jar libraries. My current workaround is just using the was cli to upload the files before deploying the bundle. However, what if it's an internal/managed Volume? I think the databricks could include the option to upload a file to a Volume

andrewnester commented 1 month ago

Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml

You can omit all artifacts section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field

jmatias commented 1 month ago

Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml

You can omit all artifacts section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field

Thanks for sharing @andrewnester.

What happens if i want to upload the jar and deploy my workflows using the same bundle? If I set the artifact_path to the UC Volume then the whole bundle will be deployed there, no? Though perhaps that wouldn't be a bad thing...

andrewnester commented 1 month ago

@jmatias no, not really, artifact_path is a path to upload local libraries to, DABs actually doesn't yet support deploying the whole bundle to Volumes (this would be file_path config)

jmatias commented 1 month ago

@andrewnester it worked!