Open kaipak opened 1 year ago
Still running into some Glue permission errors to update table views. Still in progress with ops.
There has been some progress, however, still failing at the end.
Moving to icebox. Per discussion today, going to focus on documenting procedures for loading logs and generating metrics versus automating.
I've made progress on this issue and currently testing. I have a crawler that can successfully update Athena tables when news files appear in S3. However, I've run into significant data quality issues that we should address. I will create a new issue around this but give a brief description here:
The datetime field is in a non-standard format which requires complex pre-processing in standard data science tools like SQL or Python which can lead to errors. Also, I've run into instances where this field may have unexpected whitespace or missing portions of datetime. This can be addressed in code too but adds yet another layer of complexity to manage across multiple tools in different codebases. We may want to consider a more robust set of steps for log pre-processing suitable for data analytics applications.
Need addditional AWS permissions for Crawler to update remaining Athena tables. Ticket created with SA team. DSIO-5624
Crawler will automate Athena to update tables after logs sync'd from fileserver to S3.
Review example here: https://www.mikulskibartosz.name/start-glue-crawler-using-boto3/#:~:text=AWS%20gives%20us%20a%20few,to%20refresh%20an%20Athena%20table.&text=If%20the%20crawler%20already%20exists%2C%20we%20can%20reuse%20it.