awslabs / athena-glue-service-logs

Glue scripts for converting AWS Service Logs for use in Athena
Apache License 2.0
142 stars 45 forks source link

Investigate Glue Crawlers and Workflows #15

Open dacort opened 5 years ago

dacort commented 5 years ago

Crawlers can now use existing tables as a crawler source, which may give us the ability to deprecate our custom partitioning code that searches S3 for new partitions.

In combination with Workflows, we could easily trigger a Crawler to run after our job is finished.

davehowell commented 4 years ago

I've done this previously and it works well. In CF or Terraform specify the glue database, table and also the crawler that depends on that table, then at the end of the glue script after the job.commit() something like this. Super easy!

import boto3
glue_client = boto3.client('glue', region_name='${region}')
glue_client.start_crawler(Name='${glue_crawler_name}')
dacort commented 3 years ago

I'm looking back into this again as noted in #23

Probably the part of this project that I was least happy with (but also kind of proud of 😆 ) was the partition management portion. We couldn't originally use Glue Crawlers because we wanted to control the table names and already knew the schemas, but now we can pre-create the tables and use the Crawlers to update the partitions.

This, to me, seems like a better approach than managing custom partitioning logic inside the job itself, but it does have the downside of a more complex workflow. Instead of having a single job that manages raw and converted tables and partitions, we would need to have the following:

And with the addition of Blueprints, we could essentially package this all up. Blueprints can take a set of parameters (see screenshot below) and then you can create a Workflow from the Blueprint.

Combining Blueprints with Workflows and pre-configured Crawlers would probably cut 80% of the code in this project, which would be a fantastic result. The more components of Glue I can successfully leverage the better.

A couple notes on running Crawlers on existing tables: