ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Ingest affordability risk index ari for modeling #471

Closed Damonamajor closed 3 weeks ago

Damonamajor commented 1 month ago

Billy and I are running into some conceptual issues regarding the ARI - DCI indexes. Originally, the query was built off IHS. This made sense since they were both geographic indexes for housing rankings. But that join is based on 2010 census data and specific view “census_2010”. Billy tried playing with removing that view, but deemed it to be necessary. Because of this, it’s been housed in the ‘other’ bucket. Do we want to keep it there, or shift it to the ‘spatial’ bucket? Billy was wishy-washy on the need to create a census_2020. Is that a desire?

And, I’ve been reading through the docs, and peeping through the different aws buckets. It appears we have a lot of single-year databases (walkability, economy, etc.). Do we want ARI- DCI to be joined in that structure (based off a single year), or the expectation that we will download updated datasets on a regular basis?

Related to this, when we last touched on it, we discussed the data being joined based on year of download. When I mentioned it to Billy / Nicole, there was hesitancy, and they suggested based on the year that the data was constructed. This would be 2023 for ARI and 2024 for DCI. If this is something we construct a single time, this would be easy to forward fill. But, if we have some expectation to update it annually (or at odd intervals), the sql join may be a bit wonkier. I've been playing with a query where the goal is to identify the maximum year of upload, which is less than the tract year, but it wouldn't seem useful to implement until we have a multi-year structure in place. For the moment, a join based on >= year would make sense since it's only downloaded once. This seems to be our "standard" approach for things where there isn't a full expectation of yearly download (census).

I also modified the script to directly query the url rather than the structure I had before of identifying the xlsx file on the page, since the script / processes seemed too ambiguous for that.

I included etl/scripts-ccao-data-warehouse-us-east-1/export/export-geojson.R, because IHS was included in it, but then ran into issues where the function read_geoparquet_sf was deprecated. Billy said you are aware of it, but I expect these changes just to be reverted.

wrridgeway commented 1 month ago

@jeancochrane We got dbt to build successfully, but we're running into some linter errors related to caching and github actions I'm not very familiar with.

jeancochrane commented 1 month ago

@wrridgeway I think the cache permissions warning is a red herring, since it's just a warning. This is the line that I think is the culprit for the failing linter:

etl/scripts-ccao-data-raw-us-east-1/housing/housing-ari.py:72: error: Value of type "Optional[str]" is not indexable  [index]

The issue there is that os.environ.get() can return None if the env var isn't set, in which case attempting to slice the indeces [5:] would raise a TypeError.

wrridgeway commented 1 month ago

Thanks, I should have realized. I was addressing that same issue elsewhere.