aws-samples / aws-serverless-workshop-innovator-island

Welcome to the Innovator Island serverless workshop! This repo contains all the instructions and code you need to complete the workshop.
MIT No Attribution
564 stars 228 forks source link

Firehose > S3 reaches QuickSight's limits extremely quickly #34

Closed shrkbait-hpe closed 4 years ago

shrkbait-hpe commented 4 years ago

Thank you so much for this amazing workshop. I was in dire need of an easy analytics solution and deployed your day 4 end to end without trouble and had some great charts. Then important all my historical data and again all was well. I recently went to do a refresh of the data and it is failing due to the 1000 file limit https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html being reached. I have some enterprise support cases open but am hoping you can provide some advice.

I did a little math at at Firehose's slowest writes of 15 minutes, you reach 1000 files within 11 days if you have steady traffic.

To make matters worse, if any of your files are <5MB you cannot use some of the common S3 tools to concatenate files together with MULTIPART upload.

To stick with S3 it seems one needs a daily and then monthly job to join files together. Have you seen anyone do this?

I would kindly suggest you add a little warning to day 4 that as is it is really proof of concept.

Thank you!

jbesw commented 4 years ago

Great to hear you enjoyed the workshop! In practice, when working with high levels of traffic, or over longer time periods, there are a few things you need to do to manage the data. This workshop skips some steps that are common in managing streaming data systems, mainly to keep the workshop manageable and focused on the QuickSight analysis.

It depends on the use case but I would recommend using an data transformation Lambda function at the point of ingestion (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html). This is an extremely common first step in cleaning and preparing data. You can also use a secondary step (such as an Lambda function invoked by the initial S3 PUT operation) to gzip files and/or concatenate smaller files - this compression step can dramatically reduce the number of files and data. In larger cases, you can also use AWS Glue Crawlers to scan datasets and create a Glue Catalog that can be used via services like Athena to work with QuickSight.

I hope this helps! I might develop this module into a more complex or standalone analytics workshop if there is interest. If you have any questions, feel free to ping me anytime - jbeswick at Amazon.com.

Thanks, James