coursera / dataduct

DataPipeline for humans.
Other
252 stars 82 forks source link

enable compression for load-redshift and s3-node #238

Closed p5k6 closed 8 years ago

p5k6 commented 8 years ago

tl;dr - I would like to add compression options to load-redshift and s3_node. Relevant aws documentation: s3-node and RedshiftCopyActivity

We have a use case at my employer where we have to push some fairly large tables (about 500 GB uncompressed) from mysql => redshift. I created a custom step (based on extract-rds) to compress throughout the pipeline. However, this required some mods to both s3-node and load-redshift. I wanted to pass these options back into the mainline project. PR forthcoming

Also - I'd be happy to contribute the custom step (I called it ExtractMysqlGzip, for lack of a better term). The only reason I did not create a PR for this is - well, the custom step is pretty hacky to get around aws's limitations imposed on s3datanodes that have compression enabled.

p5k6 commented 8 years ago

FYI - here's a link to the custom step I'm using to trick data pipeline into compressing the data coming out of mysql

p5k6 commented 8 years ago

Merged in PR 239