how to handle aws schema changes

xxxVxxx commented 6 years ago

The problem with using Athena or quicksight is that it , it does not take into consideration the dynamic change in schema of the billing files , that AWS normally does. Due to this the reports starts breaking. Is there a way to handle this using glue or any other method for this tool. Would really be helpful.

concurrencylabs commented 6 years ago

Yes, that's one of the main problems the code in this repo solves. If you run the report_utils.py script as described in the README file: python report_utils.py --action=prepare-athena --source-bucket=<s3-bucket-with-cost-usage-reports> --source-prefix=<folder>/ --dest-bucket=<s3-bucket-athena-will-read-files-from> --dest-prefix=<folder>/ --year=<year-in-4-digits> --month=<month-in-1-or-2-digits>, the script reads the manifest.json file created by AWS (which contains the schema, which as you mention varies regularly) and automatically creates the Athena table with the latest schema.

Alternatively, you can follow the instructions in the section "Serverless Application Model Stack" and deploy the SAM template under cloudformation/process-cur-sam.yml. This triggers a Step Function that creates the Athena table every time a new Cost and Usage report is placed in S3, using an S3 Event.

I hope this answers your question.

marcelboettcher commented 6 years ago

Hey @concurrencylabs,

thanks for your work and explanation. I understand that each billing file can have a different schema and you create a Athena table for it.

Does the schema of aws contain only new columns or do they even delete some? I experiencing both cases right now during two months (October and November 2018).

Do you create an Athena table with a schema containing all ever used columns?

Can QuickSight handle the changing Athena table and reimport the new structure automatically?

Thanks for helping out. AWS is not doing a great job here. Marcel

concurrencylabs commented 6 years ago

Hi Marcel - the AWS Manifest file contains all columns, regardless of what's been added/removed from the previous schema. That's why it's important to delete an existing table in Athena and create a new one with the new schema. The prepare-athena solution does that and it drops any existing Athena table and it creates a new one with the schema from the AWS manifest:

python report_utils.py --action=prepare-athena --source-bucket=<s3-bucket-with-cost-usage-reports> --source-prefix=<folder>/ --dest-bucket=<s3-bucket-athena-will-read-files-from> --dest-prefix=<folder>/ --year=<year-in-4-digits> --month=<month-in-1-or-2-digits>

If you want to visualize your billing data in QuickSight, then you can create a QuiskSight manifest:

python report_utils.py --source-bucket=<s3-bucket-for-quicksight-files> --source-prefix=<folder>/ --action=create-manifest --manifest-type=quicksight --year=<year-in-4-digits> --month=<month-in-1-or-2-digits>

The script outputs the full location of the JSON QuickSight manifest file that you specify in the QuickSight console when you configure an S3 data source. As of now, triggering the creation of the QuickSight manifest is manual, but it should be easy to automate. QuickSight supports an option to automatically refresh data sources, provided the location of the S3 manifest is constant.

concurrencylabs / aws-cost-analysis

how to handle aws schema changes #14