NASA-PDS / web-analytics

Other
0 stars 0 forks source link

Implement crawler to refresh Athena table partitions. #19

Open kaipak opened 1 year ago

kaipak commented 1 year ago

Crawler will automate Athena to update tables after logs sync'd from fileserver to S3.

Review example here: https://www.mikulskibartosz.name/start-glue-crawler-using-boto3/#:~:text=AWS%20gives%20us%20a%20few,to%20refresh%20an%20Athena%20table.&text=If%20the%20crawler%20already%20exists%2C%20we%20can%20reuse%20it.

kaipak commented 1 year ago

Still running into some Glue permission errors to update table views. Still in progress with ops.

kaipak commented 1 year ago

There has been some progress, however, still failing at the end.

jordanpadams commented 9 months ago

Moving to icebox. Per discussion today, going to focus on documenting procedures for loading logs and generating metrics versus automating.

kaipak commented 5 months ago

I've made progress on this issue and currently testing. I have a crawler that can successfully update Athena tables when news files appear in S3. However, I've run into significant data quality issues that we should address. I will create a new issue around this but give a brief description here:

The datetime field is in a non-standard format which requires complex pre-processing in standard data science tools like SQL or Python which can lead to errors. Also, I've run into instances where this field may have unexpected whitespace or missing portions of datetime. This can be addressed in code too but adds yet another layer of complexity to manage across multiple tools in different codebases. We may want to consider a more robust set of steps for log pre-processing suitable for data analytics applications.

kaipak commented 5 months ago

Need addditional AWS permissions for Crawler to update remaining Athena tables. Ticket created with SA team. DSIO-5624