Open exptom opened 4 years ago
Hi Tom, the missing steps are 1) adding each gz-partition and 2) run the transformation for each day. For 1) this should work with msck repair table cf_access_logs.partitioned_gz
. For 2) you could change the transformPartition
function to override the year, month, day, hour in the event. Would you like to try that approach and contribute a pull request?
Glad to see I wasn't alone in this! Thanks for pointing out the two steps here, Steffen. I will have a look at this in the coming days and see if I can sort it out. I'm new to AWS so it may not be efficient, but if I'm able to cobble something together that would be of use to the stack then I'll open a pull.
Hi guys, I found there are many GZIP log files which were not transformed to parquet in two days last Nov. You can see the chart below.
After digging the problem, the root cause is AWS was not stable at those time and delay the cloudfront log delivery. Therefore, I added a lambda function TransformMissingGzDataDailyFn
to transform the late data daily which will be triggered at 6am next day. This lambda function will also filter out the logs which were already in the parquet files. You can see the detail code in Support histroical and delay logs import PR #16.
Based on above changes, we can import the historical data manually as follows:
new
directory.
$ aws s3 cp --recursive --exclude "*" --include "EFXYZ1234XYZ.2020-11-20*" \
s3://cloudfrontlogs-raw/ s3://cloudfrontlogs-all/new
GZ
partitions by invoking CreatePartFn
lambda
$ aws lambda invoke --function-name CreatePartFn \
--payload '{"dth":"2020-11-20T00"}' --cli-binary-format raw-in-base64-out response.json
.....
$ aws lambda invoke --function-name CreatePartFn \
--payload '{"dth":"2020-11-20T23"}' --cli-binary-format raw-in-base64-out response.json
GZ
logs by invoking TransformMissingGzDataDailyFn
lambda
$ aws lambda invoke --function-name TransformMissingGzDataDailyFn \
--payload '{"dt":"2020-11-20"}' --cli-binary-format raw-in-base64-out response.json \
--cli-read-timeout 300
Hi,
I have implemented this stack and it is working well. However I dropped my old cloudfront logs into the
new
directory and it moved them into thepartitioned-gz
directory as expected however I am unsure what the best way to trigger the create/transformPartition lambdas is to process them into thepartitioned-parquet
directory. Only new data is being transformed into parquet format because those lambdas work based on the current date/time.Any ideas welcome!