Solution does not scale for large existing AWS accounts.

BigDataDaddy commented 3 years ago

I have a large existing AWS account with a few years of CloudTrail log data already in my source S3 bucket. When deploying this solution and manually running the first crawler it runs, but does not finish, at least in any reasonable amount of time. I ran the CloudTrailRawCrawler crawler for 24 hours and it didn't finish the first crawl of the source CloudTrail bucket. I suspect this is due to a few years worth of daily partitions and very large number of small existing CloudTrail log files. Not this source S3 bucket only contains the CloudTrail logs for one account that is 99% dominated by 1 AWS region. So, there aren't an unreasonable number of partitions to crawl.

Is there any way to speed up or parallelize the initial crawl?

BigDataDaddy commented 3 years ago

BTW, I chose this repo over several other CloudTrail partitioners for 2 reasons:

Transforming the data from horrible JSON to Parquet is the absolute right thing to do for query speed, especially in Athena.
I love the use of a terraform module to deploy compared to the cryptic AWS CloudFormation (CF) or Cloud Developer Kit (CDK).

Thanks to Alex for that!!!

BigDataDaddy commented 3 years ago

Is anyone responding to issues for this repo?

BigDataDaddy commented 2 years ago

Still no response to scaling this solution?

alsmola / cloudtrail-parquet-glue

Solution does not scale for large existing AWS accounts. #4