Investigate Glue Crawlers and Workflows

dacort commented 5 years ago

Crawlers can now use existing tables as a crawler source, which may give us the ability to deprecate our custom partitioning code that searches S3 for new partitions.

In combination with Workflows, we could easily trigger a Crawler to run after our job is finished.

davehowell commented 5 years ago

I've done this previously and it works well. In CF or Terraform specify the glue database, table and also the crawler that depends on that table, then at the end of the glue script after the job.commit() something like this. Super easy!

import boto3
glue_client = boto3.client('glue', region_name='${region}')
glue_client.start_crawler(Name='${glue_crawler_name}')

dacort commented 3 years ago

I'm looking back into this again as noted in #23

Probably the part of this project that I was least happy with (but also kind of proud of 😆 ) was the partition management portion. We couldn't originally use Glue Crawlers because we wanted to control the table names and already knew the schemas, but now we can pre-create the tables and use the Crawlers to update the partitions.

This, to me, seems like a better approach than managing custom partitioning logic inside the job itself, but it does have the downside of a more complex workflow. Instead of having a single job that manages raw and converted tables and partitions, we would need to have the following:

Source crawler for adding new partitions
Job for handling the conversion
Destination crawler for converted data
Workflow for orchestrating the above

And with the addition of Blueprints, we could essentially package this all up. Blueprints can take a set of parameters (see screenshot below) and then you can create a Workflow from the Blueprint.

Combining Blueprints with Workflows and pre-configured Crawlers would probably cut 80% of the code in this project, which would be a fantastic result. The more components of Glue I can successfully leverage the better.

A couple notes on running Crawlers on existing tables:

If you run the crawler without the appropriate classifier, it removes the schema.
If you run the crawler with schema updates enabled, it will change the partition names.
It seems like it only adds the partitions (at least for ALB) when we completely ignore schema updates. I tried "add new columns only" and it still didn't add the partitions but I may need to try recreating the table and crawler from scratch.

awslabs / athena-glue-service-logs

Investigate Glue Crawlers and Workflows #15