ccao-data / model-res-avm

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
GNU Affero General Public License v3.0
20 stars 3 forks source link

Update `ingest` stage to use `noctua` `unload = TRUE` option #227

Closed dfsnow closed 2 weeks ago

dfsnow commented 2 months ago

Now that https://github.com/DyfanJones/noctua/pull/215 is merged, we should update this repo to use the new option, preferably by setting it globally within noctua_options(unload = TRUE). This will require a bit of work to QC, as I expect it may result in minor changes to the ingested data.

For example, the loc_tax_municipality_name column is stored as an array in Athena. It gets imported as a comma-separated string using the old method (which is then parsed back to a list). The new method using unload = TRUE natively supports arrays, and so imports loc_tax_municipality_name as a list. However, we should double-check that none of the values or treatments have changed.

Damonamajor commented 1 month ago

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

image

dfsnow commented 1 month ago

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities.

image

@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_ prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.

Damonamajor commented 1 month ago

@dfsnow - Is there a particular output that you want for the value / treatment tests? I ran the ingest script twice, once with the unload = TRUE and once without the unload = TRUE. I then sorted the columns (they were ordered differently which is something to note), and used dataCompareR to test the different outputs. The only differences were related to small decimal disparities. image

@Damonamajor As mentioned in the issue, I would specifically focus on the categorical columns that are stored in Athena as arrays. That's mostly anything with a loc_tax_ prefix. I'm worried that the string processing we set up for the previous method may interfere with the new one and give us different results.

Those columns didn't ping anything, but I'll take a deeper dive into them.