e-n-f / housing-inventory

San Francisco housing construction history and associated data
https://experimental-geography.blogspot.com/2016/05/employment-construction-and-cost-of-san.html
135 stars 25 forks source link

Data consistency issues #3

Open kevinburke opened 6 years ago

kevinburke commented 6 years ago

Just scanning the 2016 file, I found the following entries:

May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map 
May 1 efficiency studio available now! $99 deposit! $2885 / 450ft2 - (nob hill) pic map 
May 1 jr. 1 BD. Washer & Dryer in unit! $99 deposit $3250 / 1br - 550ft2 - (nob hill) pic map 
May 1 $99 Deposit- Text us for more info!!! $2830 / 405ft2 - (nob hill) pic map 
Apr 29 Exceptional Pacific Heights TIC $799000 / 2br - (Pacific Heights) pic
Apr 29 Awesome 5 Bedroom Available $800 / 5br - 3895ft2 - (2483 N Smiderle, San Bernardino, CA) pic

The first two are in San Jose and the same price appears twice. The other ones get listed as $99 by the "extract-craigslist" and "calc-medians" scripts. The last one is not in San Francisco.

Do you deduplicate or strip these out anywhere before doing analysis on them? I understand you can work around these issues a little bit by taking the median, but I do worry especially about overreporting at the low end.

Here's a script I used to work around these problems a little bit. I need to add deduplication to it.

package main

import (
    "bufio"
    "flag"
    "fmt"
    "log"
    "os"
    "regexp"
    "sort"
    "strconv"

    "github.com/kevinburke/housing-inventory-analysis/stats"
)

var parseRx = regexp.MustCompile(`\$[0-9]{2,10}`)

func getPrice(linePrices []string) int {
    if len(linePrices) == 0 {
        return -1
    }
    prices := make([]int, len(linePrices))
    for i := range linePrices {
        if len(linePrices[i]) < 2 {
            panic("too short: " + linePrices[i])
        }
        price, err := strconv.Atoi(linePrices[i][1:])
        if err != nil {
            panic(err)
        }
        prices[i] = price
    }
    if len(linePrices) == 1 {
        return prices[0]
    }
    if prices[0] < 200 && prices[1] < 200 {
        return -1
    }
    if prices[1] > prices[0] {
        return prices[1]
    }
    return prices[0]
}

func main() {
    flag.Parse()
    f, err := os.Open(flag.Arg(0))
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    bs := bufio.NewScanner(f)
    prices := make([]float64, 0)
    for bs.Scan() {
        linePrices := parseRx.FindAllString(bs.Text(), -1)
        if len(linePrices) > 0 {
            price := getPrice(linePrices)
            if price < 0 || price > 100000 {
                // sf is expensive, but not *that* expensive
                continue
            }
            prices = append(prices, float64(price))
        }
    }
    if err := bs.Err(); err != nil {
        log.Fatal(err)
    }
    sort.Float64s(prices)
    vals := stats.Sample{Xs: prices}
    fmt.Printf("Total rows: %d\n", len(prices))
    for i := float64(1); i <= 9; i++ {
        fmt.Printf("%dth %%ile: %v\n", int(i)*10, vals.Percentile(0.1*i))
    }
}
e-n-f commented 6 years ago

Thanks for the script! I was not doing any filtering on the files, so you have probably found some errors.