ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Remove common area from nonlivable modeling group #452

Closed wrridgeway closed 1 month ago

wrridgeway commented 1 month ago

I was going to handle this within the modeling pipeline, but it needs to be addressed here as well.

wrridgeway commented 1 month ago

I ran the ingest stage of the condo model pipeline with both model.vw_pin_condo_input (old) and z_ci_152_determine_parking_spacecommon_area_flag_hierarchy_model.vw_pin_condo_input (new).

training_data

# ingest new and old training data
> new <- read_parquet("input_new/training_data.parquet") %>%
      filter(meta_modeling_group == "NONLIVABLE")
> old <- read_parquet("input/training_data.parquet") %>%
      filter(meta_modeling_group == "NONLIVABLE")

# filter only sales from training data that are no longer considered common area
> changed <- old %>%
      filter(!(meta_pin %in% new$meta_pin))

> nrow(changed)
[1] 9

> summary(changed$meta_sale_price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 100000  163000  315523  305444  350000  723500 

It seems like it was probably pretty silly to even consider these common area in the first place based on their sale prices. But it's a very small number of parcels that changes in the training data, regardless.

assessment_data

> new <- read_parquet("input_new/assessment_data.parquet") %>%
      filter(meta_modeling_group == "NONLIVABLE")
> old <- read_parquet("input/assessment_data.parquet") %>%
      filter(meta_modeling_group == "NONLIVABLE")

> changed <- old %>%
      filter(!(meta_pin %in% new$meta_pin))

> nrow(changed)
[1] 223

> length(unique(changed$meta_pin10))
[1] 51

So we've got 223 units from 51 different buildings that were previously considered NONLIVABLE that are now treated as normal condo units for assessment. 131 of these 223 units have a non-null value for char_bedroom, which probably should have given us cause for concern in the past either about the characteristics for these units or their status as common areas.

A very small number of parcels is affected by this change.

wrridgeway commented 1 month ago

131 of these 223 units have a non-null value for char_bedroom, which probably should have given us cause for concern in the past either about the characteristics for these units or their status as common areas.

@wrridgeway Can we add a dbt data test that throws an error if a parking space or common area also has characteristics?

Done, but it's not particularly encouraging.