ccao-data / model-res-avm

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
GNU Affero General Public License v3.0
20 stars 3 forks source link

Comps check #224

Closed ccao-jardine closed 3 months ago

ccao-jardine commented 3 months ago

In the Rogers Park desk review spreadsheet, occasionally comp1 and comp2 will share the same PIN and sale price but different comp score. Example is for subject PIN 11304200220000. My hypothesis: these are PINs that sold twice, hence the different comp scores, but the sale price printed in the sheet is the PIN's most recent sale price.

Is that expected/desired behavior, and at what stage of the pipeline does it arise (later for the desk review spreadsheet, or earlier in the comps algorithm)?

Let's identify where this occurs and determine whether it's something to correct.

ccao-jardine commented 3 months ago

Thanks @jeancochrane for investigating. Pasting our discussion here for @julia-klauss and @dfsnow and closing for now.

Query to check dupe comps:

WITH unpivoted_comps AS (
    SELECT pin, comp_pin
    FROM (
        SELECT
            pin,
            ARRAY[
                comp_pin_1, comp_pin_2, comp_pin_3, comp_pin_4, comp_pin_5,
                comp_pin_6, comp_pin_7, comp_pin_8, comp_pin_9, comp_pin_10,
                comp_pin_11, comp_pin_12, comp_pin_13, comp_pin_14, comp_pin_15,
                comp_pin_16, comp_pin_17, comp_pin_18, comp_pin_19, comp_pin_20
            ] AS comp_array
        FROM model.comp
        WHERE run_id = '2024-03-17-stupefied-maya'
    ) AS comp_data
    CROSS JOIN UNNEST(comp_array) AS comp_pin(comp_pin)
)
SELECT pin, COUNT(comp_pin) AS num_comps, COUNT(DISTINCT comp_pin) as num_distinct_comps 
FROM unpivoted_comps
GROUP BY pin
HAVING COUNT(DISTINCT comp_pin) < COUNT(comp_pin);

What we learned:

Therefore, when analyzing comps, we should take care to not assume distinct rows for PINs in the training data and in the comps. Such that: