Closed ccao-jardine closed 3 months ago
Thanks @jeancochrane for investigating. Pasting our discussion here for @julia-klauss and @dfsnow and closing for now.
Query to check dupe comps:
WITH unpivoted_comps AS (
SELECT pin, comp_pin
FROM (
SELECT
pin,
ARRAY[
comp_pin_1, comp_pin_2, comp_pin_3, comp_pin_4, comp_pin_5,
comp_pin_6, comp_pin_7, comp_pin_8, comp_pin_9, comp_pin_10,
comp_pin_11, comp_pin_12, comp_pin_13, comp_pin_14, comp_pin_15,
comp_pin_16, comp_pin_17, comp_pin_18, comp_pin_19, comp_pin_20
] AS comp_array
FROM model.comp
WHERE run_id = '2024-03-17-stupefied-maya'
) AS comp_data
CROSS JOIN UNNEST(comp_array) AS comp_pin(comp_pin)
)
SELECT pin, COUNT(comp_pin) AS num_comps, COUNT(DISTINCT comp_pin) as num_distinct_comps
FROM unpivoted_comps
GROUP BY pin
HAVING COUNT(DISTINCT comp_pin) < COUNT(comp_pin);
What we learned:
Therefore, when analyzing comps, we should take care to not assume distinct rows for PINs in the training data and in the comps. Such that:
if a PIN only appears once in a comp set, it's reasonably a safe assumption to select its most recent sale.
if a PIN appears more than once in a comp set, assign the most recent sale to the comp with the higher comp_score, then the next lowest comp_score to the next most recent sale
In the Rogers Park desk review spreadsheet, occasionally comp1 and comp2 will share the same PIN and sale price but different comp score. Example is for subject PIN 11304200220000. My hypothesis: these are PINs that sold twice, hence the different comp scores, but the sale price printed in the sheet is the PIN's most recent sale price.
Is that expected/desired behavior, and at what stage of the pipeline does it arise (later for the desk review spreadsheet, or earlier in the comps algorithm)?
Let's identify where this occurs and determine whether it's something to correct.