How to get load_time or scheduled time ?

coursera / dataduct

DataPipeline for humans.

Other

252 stars 82 forks source link

How to get load_time or scheduled time ? #137

Closed darkcrawler01 closed 9 years ago

darkcrawler01 commented 9 years ago

I am looking to port my pipelines to dataduct but I am unable to find the answer from the docs whether there is a way to pass scheduled start time or load_time in dataduct terminology. Here is my use case:

My source application writes S3 files into 15 minute folder ie YY-DD-MM-HH-(00,15,30,45) Then, Using Data pipeline, I load these files by constructing the s3 source path using schedule start time rounded up to the nearest 15th minute.

Please let me know if there is a way in dataduct to create s3 input nodes using load_time as a parameter.

sb2nov commented 9 years ago

I think you can use the extract-s3 step similar to http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-s3datanode.html

So your directory path should use something like #{format(@scheduledStartTime,'YY-DD-MM-HH')} in it. And then you can merge all the different input nodes for a post processing step.

darkcrawler01 commented 9 years ago

So just to sure, this would work ?

-   step_type: extract-s3
    directory_uri: s3://elasticmapreduce/#{format(@scheduledStartTime,'YY-DD-MM-HH')}/wordcount/wordSplitter.py

sb2nov commented 9 years ago

you're using a directory_uri so something like

-   step_type: extract-s3
    file_uri: s3://elasticmapreduce/#{format(@scheduledStartTime,'YY-DD-MM-HH')}/wordcount/wordSplitter.py

should work. And similarly you can use a directory as well.

workanandr commented 9 years ago

I would like to use similar expression in my sql-command. I have tried with both the following code, but they are not working. Any help is appreciated. Thanks.

step_type: sql-command
    command: |
        unload ('select * from tmp_tbl2') to 
        's3://mybucket/data/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}/'
        credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';

step_type: sql-command
    command: |
        unload ('select * from tmp_tbl2') to 
        's3://mybucket/data/#{@scheduledStartTime}/'
        credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';