coursera / dataduct

DataPipeline for humans.
Other
252 stars 83 forks source link

How to use multiple scripts or directories in a transform step? #262

Open EAbis opened 7 years ago

EAbis commented 7 years ago

First of all thank you for making this project, it has made AWS DataPipeline useable.

My question is how do you go about passing multiple scripts, or multiple directories into the a pipeline's YAML file? The reason I'm asking is because I want to consolidate common functionality without having to pass the entire directory for every job to every datapipeline.

For example we currently have a project structure that looks something like this:

Jobs
- Job1
- - job1.py
- - job1.yaml
- - duplicated_utility.py
- Job2
- - job2.py
- - job2.yaml
- - duplicated_utility.py
- Job3
- - job3.py
- - job3.yaml
- - duplicated_utility.py
...

What I want to do is to consolidate the duplicate utility.py files into one file or collection of files in a lib directory. So what I want it to look like would be:

Jobs
- Job1
- - job1.py
- - job1.yaml
- Job2
- - job2.py
- - job2.yaml
- Job3
- - job3.py
- - job3.yaml
lib
- utliity.py
...

The problem with this is that you would have to pass in the directory and point to a specific script name for each job, thus meaning you are moving a lot of extra files around for no reason. You could also create symlinks within each job to the lib directory but that requires a lot of overhead and just isn't ideal.

Is there some functionality I'm not aware of, or a best practice to be used?

Thanks!