coursera / dataduct

DataPipeline for humans.
Other
252 stars 83 forks source link

rds splitting of output files #227

Closed cliu587 closed 8 years ago

cliu587 commented 8 years ago

When outputting the result of a RDS query to S3, it is often useful to split the output to equal sized files. For example, loading into Redshift equal sized files in the number of slices is much more efficient. To support, this we add a splits parameter to create-load-redshift that allows the output of extract-rds step to be split.

PTAL @sb2nov

sb2nov commented 8 years ago

@cliu587 LGTM

Can you figure out what is with the build though ?

cliu587 commented 8 years ago

According to https://travis-ci.org/coursera/dataduct/branches, the develop branch build is broken, and the failures for this build are the same as develop. I will take a look at them this weekend.

anguswalker commented 7 years ago

@cliu587 @sb2nov In relation to the above ticket, was the issue resolved? I am trying to split the extract_rds tsv file into n parts. I have tried changing the hardcoded variable split in extract_rds and adding split = n to the config file as an additional parameter in the ETL section but when I view the s3node0&1 output, both folders only contain one file. What is the correct way to split files for the create-load-redshift function?