aws-samples / amazon-cloudfront-access-logs-queries

Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena.
MIT No Attribution
111 stars 70 forks source link

Trigger createPartition / transformPartition lambda for historical data #7

Open exptom opened 4 years ago

exptom commented 4 years ago

Hi,

I have implemented this stack and it is working well. However I dropped my old cloudfront logs into the new directory and it moved them into the partitioned-gz directory as expected however I am unsure what the best way to trigger the create/transformPartition lambdas is to process them into the partitioned-parquet directory. Only new data is being transformed into parquet format because those lambdas work based on the current date/time.

Any ideas welcome!

steffeng commented 4 years ago

Hi Tom, the missing steps are 1) adding each gz-partition and 2) run the transformation for each day. For 1) this should work with msck repair table cf_access_logs.partitioned_gz. For 2) you could change the transformPartition function to override the year, month, day, hour in the event. Would you like to try that approach and contribute a pull request?

aleksndrr commented 4 years ago

Glad to see I wasn't alone in this! Thanks for pointing out the two steps here, Steffen. I will have a look at this in the coming days and see if I can sort it out. I'm new to AWS so it may not be efficient, but if I'm able to cobble something together that would be of use to the stack then I'll open a pull.

titanjer commented 3 years ago

Hi guys, I found there are many GZIP log files which were not transformed to parquet in two days last Nov. You can see the chart below.

Screen Shot 2021-02-02 at 3 53 23 PM

After digging the problem, the root cause is AWS was not stable at those time and delay the cloudfront log delivery. Therefore, I added a lambda function TransformMissingGzDataDailyFn to transform the late data daily which will be triggered at 6am next day. This lambda function will also filter out the logs which were already in the parquet files. You can see the detail code in Support histroical and delay logs import PR #16.

Based on above changes, we can import the historical data manually as follows:

  1. copy the historical log files to new directory.
    $ aws s3 cp --recursive --exclude "*" --include "EFXYZ1234XYZ.2020-11-20*" \
        s3://cloudfrontlogs-raw/ s3://cloudfrontlogs-all/new
  2. create new GZ partitions by invoking CreatePartFn lambda
    $ aws lambda invoke --function-name CreatePartFn \
        --payload '{"dth":"2020-11-20T00"}' --cli-binary-format raw-in-base64-out response.json
    .....
    $ aws lambda invoke --function-name CreatePartFn \
        --payload '{"dth":"2020-11-20T23"}' --cli-binary-format raw-in-base64-out response.json
  3. transform the missing GZ logs by invoking TransformMissingGzDataDailyFn lambda
    $ aws lambda invoke --function-name TransformMissingGzDataDailyFn \
        --payload '{"dt":"2020-11-20"}' --cli-binary-format raw-in-base64-out response.json \
        --cli-read-timeout 300