coursera / dataduct

DataPipeline for humans.
Other
252 stars 83 forks source link

Issue with importing ec2 resource while running on emr #225

Closed zerowgravity closed 8 years ago

zerowgravity commented 8 years ago

Traceback (most recent call last): File "/usr/local/bin/dataduct", line 347, in <module> main() File "/usr/local/bin/dataduct", line 337, in main pipeline_actions(frequency_override=frequency_override, **arg_vars) File "/usr/local/bin/dataduct", line 75, in pipeline_actions from dataduct.etl import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/__init__.py", line 1, in <module> from .etl_actions import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/etl_actions.py", line 5, in <module> from ..pipeline import Activity File "/Library/Python/2.7/site-packages/dataduct/pipeline/__init__.py", line 5, in <module> from .ec2_resource import Ec2Resource File "/Library/Python/2.7/site-packages/dataduct/pipeline/ec2_resource.py", line 16, in <module> INSTANCE_TYPE = config.ec2.get('INSTANCE_TYPE', const.M1_LARGE) AttributeError: 'Config' object has no attribute 'ec2'

While running a dataduct activate command, the pipeline action imports an activate_pipeline action which expects an ec2_resource to be defined in the config file. my current dataduct config is setup to run jobs on an emr instance and not on an ec2 instance.

Am I missing something?

AaronTorgerson commented 8 years ago

You need an ec2 instance (thus an ec2 section in your config) to run the pipeline in addition to your emr cluster for transformations.

zerowgravity commented 8 years ago

@AaronTorgerson thanks! Since EMR instances run on top of EC2 instances, can they be reused for both running a pipeline and for transformations?

AaronTorgerson commented 8 years ago

Yes, probably - in that case you'll want to configure your EMR EC2 instance as a Task Runner: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

Then to configure it for dataduct, use this in your dataduct.cfg file (just did this recently myself):

ec2:
    WORKER_GROUP: your-worker-group-name