aws-samples / amazon-cloudfront-access-logs-queries

Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena.
MIT No Attribution
111 stars 70 forks source link

Is it possible that the log files are moved to incorrect partition? #21

Open ensean opened 2 years ago

ensean commented 2 years ago

Since the access log files are delivered to S3 asynchronously, a log file E271AZ5HG504X.2022-01-20-07.2bd0b06.gz may contents access log starts from 2022-01-20 08:00:00. If the log file is moved to partition year=2022/month=01/day=20/hour=07, is it possible that a sql with where clause where year='2022' and month='01' and day='20' and hour='08' may lost this part of data?

steffeng commented 2 years ago

Hi @ensean, from the docs:

CloudFront saves them in a log file for which the file name includes the date and time of the period in which the requests occurred, not the date and time when the file was delivered.

If this differs from what you have observed, can you please follow up with an AWS support request?

The pattern of partitioning can be applied also for other use cases. Depending on the case, the partitions can differ from some or all timestamp columns of the data queried and you need to adjust your queries to look at more partitions. I've explained it in this video.