alsmola / cloudtrail-parquet-glue

Glue workflow to convert CloudTrail logs to Athena-friendly Parquet format
MIT License
47 stars 14 forks source link

Question - Data retention and Bucket Lifecycle policies #2

Open dlethin opened 3 years ago

dlethin commented 3 years ago

Thanks for sharing this project and the extensive blog post behind it. I've got a few questions -

While I've manually setup athena to make adhoc queries against cloudtrail buckets on an as needed basis, we're considering our options for automating to have athena searchable on a daily basis across our cloudtrail logs, and your approach sounds interesting - even more appealing as we use terraform. I'm trying to catch up and learn about about AWS Glue as I have no direct experience with it nor with parquet format.

I'm curious how this solution is effected by the number of accounts/regions writing cloudtrail to the raw source bucket from the org and the number of days this data kept in the bucket before being purged by a lifecycle_rule. For example, what if we keep 6 months of log files for 30+ accounts with activity in 5 different regions. ( not exactly sure what our final retention policy will be... still working that out)

Does the crawler only crawl new objects uploaded since the last run, or does it need to crawl the entire bucket every day? What happens when objects expire via the lifecycle rule in the source bucket? Would the objects in the bucket holding the transformed parquet files get purged automatically, or would a lifecycle rule need to be written on the target bucket as well? Are the table partitions actively pruned to keep up with dates that have expired as a result of the lifecycle rule?

I will try to schedule some time in the coming week or two to try this out and that might shed light on my questions.

BigDataDaddy commented 3 years ago

WRT, the data retention and S3 bucket lifecycle rule removing the transformed data already crawled, that's going to depend on the configuration choices of the crawler behavior. See this doc page for configuration options: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html