Closed darkcrawler01 closed 9 years ago
I think you can use the extract-s3
step similar to http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-s3datanode.html
So your directory path should use something like #{format(@scheduledStartTime,'YY-DD-MM-HH')}
in it. And then you can merge all the different input nodes for a post processing step.
So just to sure, this would work ?
- step_type: extract-s3
directory_uri: s3://elasticmapreduce/#{format(@scheduledStartTime,'YY-DD-MM-HH')}/wordcount/wordSplitter.py
you're using a directory_uri so something like
- step_type: extract-s3
file_uri: s3://elasticmapreduce/#{format(@scheduledStartTime,'YY-DD-MM-HH')}/wordcount/wordSplitter.py
should work. And similarly you can use a directory as well.
I would like to use similar expression in my sql-command. I have tried with both the following code, but they are not working. Any help is appreciated. Thanks.
step_type: sql-command
command: |
unload ('select * from tmp_tbl2') to
's3://mybucket/data/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}/'
credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';
step_type: sql-command
command: |
unload ('select * from tmp_tbl2') to
's3://mybucket/data/#{@scheduledStartTime}/'
credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';
Hi
I am looking to port my pipelines to dataduct but I am unable to find the answer from the docs whether there is a way to pass scheduled start time or
load_time
in dataduct terminology. Here is my use case:My source application writes S3 files into 15 minute folder ie
YY-DD-MM-HH-(00,15,30,45)
Then, Using Data pipeline, I load these files by constructing the s3 source path using schedule start time rounded up to the nearest 15th minute.Please let me know if there is a way in dataduct to create s3 input nodes using load_time as a parameter.