Closed zerowgravity closed 8 years ago
You need an ec2 instance (thus an ec2 section in your config) to run the pipeline in addition to your emr cluster for transformations.
@AaronTorgerson thanks! Since EMR instances run on top of EC2 instances, can they be reused for both running a pipeline and for transformations?
Yes, probably - in that case you'll want to configure your EMR EC2 instance as a Task Runner: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html
Then to configure it for dataduct, use this in your dataduct.cfg file (just did this recently myself):
ec2:
WORKER_GROUP: your-worker-group-name
Traceback (most recent call last): File "/usr/local/bin/dataduct", line 347, in <module> main() File "/usr/local/bin/dataduct", line 337, in main pipeline_actions(frequency_override=frequency_override, **arg_vars) File "/usr/local/bin/dataduct", line 75, in pipeline_actions from dataduct.etl import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/__init__.py", line 1, in <module> from .etl_actions import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/etl_actions.py", line 5, in <module> from ..pipeline import Activity File "/Library/Python/2.7/site-packages/dataduct/pipeline/__init__.py", line 5, in <module> from .ec2_resource import Ec2Resource File "/Library/Python/2.7/site-packages/dataduct/pipeline/ec2_resource.py", line 16, in <module> INSTANCE_TYPE = config.ec2.get('INSTANCE_TYPE', const.M1_LARGE) AttributeError: 'Config' object has no attribute 'ec2'
While running a dataduct activate command, the pipeline action imports an activate_pipeline action which expects an ec2_resource to be defined in the config file. my current dataduct config is setup to run jobs on an emr instance and not on an ec2 instance.
Am I missing something?