coursera / dataduct

DataPipeline for humans.
Other
252 stars 82 forks source link

No documentation about necessary EC2 bootstrapping #163

Open warhammerkid opened 8 years ago

warhammerkid commented 8 years ago

The create-load-redshift step requires that the EC2 instance has dataduct installed and configs synced from S3, however there is no documentation anywhere detailing this necessity. For my purposes I have created a simple Packer script to build an AMI with the necessary dependencies. A tiny config file needs to be created and placed at .dataduct/dataduct.cfg so that sync_from_s3 will actually run.

etl:
    S3_ETL_BUCKET: your-etl-bucket
    S3_BASE_PATH: your-base-path

logging:
    LOG_DIR: ~/.dataduct

Then you can simply put something like the following in your config file:


bootstrap:
    ec2:
    -   step_type: transform
        command: dataduct config sync_from_s3 ~/.dataduct/dataduct.cfg
        no_output: true

It would be nice if this was all done automatically, but at a bare minimum it would help to have some documentation pointing people in the right direction.

warhammerkid commented 8 years ago

Here's a Gist of the Packer script I'm using: https://gist.github.com/warhammerkid/35a49f29d15d87765349

seguschin commented 8 years ago

Some other steps require dataduct being installed as well - the best solution will be to create custom AMI as it will speed up instance startup. Other way will be to create bootstrap step with all commands to bring installed, for example:

bootstrap:
    ec2:
    -   step_type: transform
        input_node: []
        command: sudo yum update -y;sudo yum install -y gcc gcc-c++ mysql56-devel MySQL-python27 postgresql94-devel graphviz python-devel s3cmd;sudo pip install dataduct;aws s3 cp s3://my_backet/config/dataduct.cfg ~/.dataduct/dataduct.cfg
        no_output: true

On side note in Pypi only 0.4.0 version of dataduct.