Open MrJSmeets opened 1 year ago
By default, DABs exclude files and folders from syncing based on .gitgnore file if you're using Git.
If you're not using Git, or don't want to include certain files in .gitignore you can use sync.exclude
property.
sync:
exclude:
- src/**/*
- databricks.yml
- build.sbt
- target/global-logging/*
Thanks andrewnester, then it seems that uploading a jar via this synchronisation method is not really the right way for Scala projects. I will instead upload my jar to adls/s3 and put the databricks.yaml file in a subfolder so I don't have to clutter my job definitions with this list of excludes.
Hopefully something similar to the file references at dbx will be available in the future. That made it very useful to upload a JAR with the job definition during local development.
Have there been any updates on this feature? We are also struggling to manage deployments of jar files as part of the DAB deployment. It doesn't seem to work via include because there isn't support for the jar file in artifacts and it complains about not having a relevant artifact specification.
@mike-smith-bb DABs already supports building and automatic upload of JARs, so the configuration can look somewhat like this
artifacts:
my_java_project:
path: ./path/to/project
build: "sbt package"
type: jar
files:
- source: ./path/to/project/targets/*.jar
Note that you have to explicitly specify files source section to point where built jars are located
Also please make sure you're using latest CLI version (0.217.1 as of now).
If you still experience any issues feel free to open issue in CLI repo here https://github.com/databricks/cli/issues
Thanks, @andrewnester . Your suggestion, I think, assumes that we are building the artifact as part of the DAB deployment. What if the jar file is built by a different process and we are simply trying to include it as part of the job cluster created and want to store it in the DAB structure. Is this supported?
@mike-smith-bb yes, indeed.
Then just using sync include section should work, does it for you?
sync:
include:
- target/scala-2.12/**/*.jar
Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need
sync: include: - target/scala-2.12/**/*.jar
Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need
@andrewnester — This doesn't seem to work with jar
files. Even if I sync the file like you showed, I can't add those jar
files as dependencies.
If I do
sync:
include:
- resources/lib/*
...
resources:
jobs:
my_job:
name: My Job
tasks:
- task_key: mytask
notebook_task:
notebook_path: ../src/mymodule/myfile.py
job_cluster_key: job_cluster
libraries:
- jar: /Workspace/${workspace.file_path}/resources/lib/my_custom.jar
I get this error:
I'm assuming it's because of this:
Do we have any options available to add a jar dependency from source like we used to do with dbx
?
We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.
We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.
Ok, I will do that in the meantime and see how that goes.
I'm having the same issue. I have a Databricks Volume for jar libraries. My current workaround is just using the was cli to upload the files before deploying the bundle. However, what if it's an internal/managed Volume? I think the databricks could include the option to upload a file to a Volume
Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml
You can omit all artifacts
section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field
Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml
You can omit all
artifacts
section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field
Thanks for sharing @andrewnester.
What happens if i want to upload the jar
and deploy my workflows using the same bundle? If I set the artifact_path
to the UC Volume then the whole bundle will be deployed there, no? Though perhaps that wouldn't be a bad thing...
@jmatias no, not really, artifact_path
is a path to upload local libraries to, DABs actually doesn't yet support deploying the whole bundle to Volumes (this would be file_path
config)
@andrewnester it worked!
Hi,
I would like to upload an existing JAR as a dependent library to a job/workflow without having to sync any other files/folders. Currently, all files/folders are always synchronized, but I don't want to sync these. I only need the jar in the target/scala-2.12 folder.
Folder structure:
With dbx, this was possible by using file references. What is the recommended way to do this via DAB, without syncing other files/folders?
I expected this to be possible via artifacts, but that seems to be (for now?) only intended for Python wheels.