databrickslabs / cicd-templates

Manage your Databricks deployments and CI with code.
Other
202 stars 100 forks source link

Deploying a script relying on multiple files #62

Closed hazardsy closed 3 years ago

hazardsy commented 3 years ago

Greetings,

I am starting to play a little with what this repository offers and there is a use case I cannot seem to make work.

Basically, I have a single entrypoint that imports classes from sibling files located in the same directory, like this :

|--projectname/
|----jobs/
|------jobsfolder/
|--------jobA.py
|--------jobB.py
|--------entrypoint.py

With entrypoint being this :

from jobA import jobA
from jobB import jobB

if __name__=="__main__":
    jobA().launch()
    jobB().launch()

Relevant part of the job definition is the following :

    "spark_python_task": {
        "python_file": "projectname/jobs/jobsfolder/entrypoint.py"
    }

When trying to deploy or execute this job, I get a ModuleNotFound error on jobA. It seems only logical since only entrypoint.py was uploaded to MlFlow.

Dipping my toes into the code I unwheeled from DBX, it looks like the intended behavior but I feel like it is a fairly frequent use case when trying to have a readable and well documented code.

Am I missing something about this use case or is it just not supported as of now ?

Anyways thank you for your work on this project, it has been very useful and a lot of fun to use for now !

renardeinside commented 3 years ago

You're using relative import instead of a package-based import:

from jobA import jobA
from jobB import jobB

Please try out the following:

from projectname.jobs.jobA import jobA
from projectname.jobs.jobB import jobB
hazardsy commented 3 years ago

It makes a lot of sense, can't believe I did not think of that before.

Thank you very much for your answer !