ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Ingest distressed communities index dci for modeling #472

Open Damonamajor opened 1 month ago

Damonamajor commented 1 month ago

One of the linting errors is that I don't have write privileges' to the necessary bucket, which I'm not going to be able to change. The other is for the following line. As I understand it, because we are using an .Renviron, it's not exactly the same, so we cut out the first 5 digits. This was what Billy recommended rather than creating a new .env file.

wrridgeway commented 3 weeks ago

@Damonamajor I've got one minor request for changes. Once that's done, can we do a row count comparison for vw_pin_shared_input between master and dev branches and a check to see how much missingness there is in the dci column for the two input views and if matches with what we expect?

Damonamajor commented 3 weeks ago

@wrridgeway The join that you recommended works, but becomes even wonkier. I have it up and running, but because of how the census data gets uploaded, the process intended to forward fill, ends up back filling. But, it should even out if we keep the data up. This is because we have the most recent data for DCI (2024). Because we then join to the max value of census (2023) we will keep data coded for that year. As soon as the census data gets updated to 2024, the max would update, 2023 would become NULL and data for 2024 would be included in the correct year. To make things slightly more complicated, our most recent column for 'census_data_year' is 2023, but for 'census_acs5_data_year' it is 2022. As far as I know, there's not really a difference between them, but it is something to note.

wrridgeway commented 3 weeks ago

I think the join is working as anticipated; I only see data for 2023. It's joining onto the most recent census zcta data we have, which is 2023 for now. Once that becomes 2024 it will join onto that and then stay there. Is this the best way to join the data on? Probably not - but I don't know if this is data we anticipate continuing to gather, and it doesn't feel like data we should back fill?

Ultimately, I'm leaning towards including another column when we ingest this that makes it clear it uses 2022 census data and that we should join on 2022 geoids and forward fill only. 2024 is when the data was made available, but the data isn't describing 2024. Thoughts @dfsnow ?

dfsnow commented 1 week ago

I think the join is working as anticipated; I only see data for 2023. It's joining onto the most recent census zcta data we have, which is 2023 for now. Once that becomes 2024 it will join onto that and then stay there. Is this the best way to join the data on? Probably not - but I don't know if this is data we anticipate continuing to gather, and it doesn't feel like data we should back fill?

Ultimately, I'm leaning towards including another column when we ingest this that makes it clear it uses 2022 census data and that we should join on 2022 geoids and forward fill only. 2024 is when the data was made available, but the data isn't describing 2024. Thoughts @dfsnow ?

@wrridgeway @Damonamajor Sorry I missed this. I would do as @wrridgeway suggests and forward fill from 2022.

Damonamajor commented 1 week ago

I think the join is working as anticipated; I only see data for 2023. It's joining onto the most recent census zcta data we have, which is 2023 for now. Once that becomes 2024 it will join onto that and then stay there. Is this the best way to join the data on? Probably not - but I don't know if this is data we anticipate continuing to gather, and it doesn't feel like data we should back fill? Ultimately, I'm leaning towards including another column when we ingest this that makes it clear it uses 2022 census data and that we should join on 2022 geoids and forward fill only. 2024 is when the data was made available, but the data isn't describing 2024. Thoughts @dfsnow ?

@wrridgeway @Damonamajor Sorry I missed this. I would do as @wrridgeway suggests and forward fill from 2022.

@dfsnow So to clarify, we want 1 column for year, which represents the year of underlying census data and one column which represents the year of download (2024) or their data construction (2023)?

dfsnow commented 1 week ago

@dfsnow So to clarify, we want 1 column for year, which represents the year of underlying census data and one column which represents the year of download (2024) or their data construction (2023)?

One column (year) for the underlying census data and another for year of construction, similar to the _data_year column in the proximity tables.