ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Refactor a few ingest scripts from R to dbt Python models #394

Open jeancochrane opened 2 months ago

jeancochrane commented 2 months ago

More detailed description to come. These scripts can be refactored into models:

dfsnow commented 2 months ago

Can only use these libraries: https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html

jeancochrane commented 2 months ago

This is unfortunately blocked because none of the pandas Excel engines come preinstalled in the PySpark Athena environment, so we would have to manage them in S3 in order to read Excel files.

Damonamajor commented 2 months ago

@dfsnow recommended I look into spatial.access, which he thinks should be able to implement without excel. @jeancochrane

Damonamajor commented 1 month ago

https://github.com/ccao-data/data-architecture/blob/1576932011fac02dd43ae4a1fde6cf07db4f35bd/dbt/models/reporting/reporting.ratio_stats.py

Damonamajor commented 1 month ago

@dfsnow Is this back to being open since (I'm presuming) we can use excel now?