alsmola / cloudtrail-parquet-glue

Glue workflow to convert CloudTrail logs to Athena-friendly Parquet format
MIT License
47 stars 14 forks source link

non-queryable raw logs, inconsistent results with same input, duplicate results #6

Open bcenker opened 3 years ago

bcenker commented 3 years ago

I've tried setting this up to compare it against an existing non-Glue-based solution, and am running into a couple of issues I'm hoping someone can help with. Disclaimer: I have little previous experience with Glue Crawlers or Glue ETL - it's possible I'm making a simple (or more than one) mistake.

I have been able to reproduce the behavior by resetting the environment (destroy terraform, empty buckets, delete glue database/tables) and following the process below:

It would require additional testing to be certain, but it appears (at least with static input, as I tested with), that subsequent workflow runs reprocess the same events and convert them to parquet again. It also appears that, given static input, the output of each subsequent ETL run appears to be inconsistent (ie on each run, a different number of events were converted to parquet).

I'm curious to see if anyone has gotten this to work properly without modifications to the Crawlers or ETL job?
If so, were you able to query the raw log table? Did multiple runs on the same dataset produce duplicate results for you? Has anyone else tried to run a static dataset through this workflow multiple times and compare the output (parquet) to the input (raw) event data to validate the consistency of the ETL process?

Thanks in advance - looking forward to hopefully learning a little more about this!