Just scanning the 2016 file, I found the following entries:
May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map
May 2 ♥Spacious Home for Rent!♥ $390 / 3br - 1200ft2 - (san jose) pic map
May 1 efficiency studio available now! $99 deposit! $2885 / 450ft2 - (nob hill) pic map
May 1 jr. 1 BD. Washer & Dryer in unit! $99 deposit $3250 / 1br - 550ft2 - (nob hill) pic map
May 1 $99 Deposit- Text us for more info!!! $2830 / 405ft2 - (nob hill) pic map
Apr 29 Exceptional Pacific Heights TIC $799000 / 2br - (Pacific Heights) pic
Apr 29 Awesome 5 Bedroom Available $800 / 5br - 3895ft2 - (2483 N Smiderle, San Bernardino, CA) pic
The first two are in San Jose and the same price appears twice. The other ones get listed as $99 by the "extract-craigslist" and "calc-medians" scripts. The last one is not in San Francisco.
Do you deduplicate or strip these out anywhere before doing analysis on them? I understand you can work around these issues a little bit by taking the median, but I do worry especially about overreporting at the low end.
Here's a script I used to work around these problems a little bit. I need to add deduplication to it.
package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
"regexp"
"sort"
"strconv"
"github.com/kevinburke/housing-inventory-analysis/stats"
)
var parseRx = regexp.MustCompile(`\$[0-9]{2,10}`)
func getPrice(linePrices []string) int {
if len(linePrices) == 0 {
return -1
}
prices := make([]int, len(linePrices))
for i := range linePrices {
if len(linePrices[i]) < 2 {
panic("too short: " + linePrices[i])
}
price, err := strconv.Atoi(linePrices[i][1:])
if err != nil {
panic(err)
}
prices[i] = price
}
if len(linePrices) == 1 {
return prices[0]
}
if prices[0] < 200 && prices[1] < 200 {
return -1
}
if prices[1] > prices[0] {
return prices[1]
}
return prices[0]
}
func main() {
flag.Parse()
f, err := os.Open(flag.Arg(0))
if err != nil {
log.Fatal(err)
}
defer f.Close()
bs := bufio.NewScanner(f)
prices := make([]float64, 0)
for bs.Scan() {
linePrices := parseRx.FindAllString(bs.Text(), -1)
if len(linePrices) > 0 {
price := getPrice(linePrices)
if price < 0 || price > 100000 {
// sf is expensive, but not *that* expensive
continue
}
prices = append(prices, float64(price))
}
}
if err := bs.Err(); err != nil {
log.Fatal(err)
}
sort.Float64s(prices)
vals := stats.Sample{Xs: prices}
fmt.Printf("Total rows: %d\n", len(prices))
for i := float64(1); i <= 9; i++ {
fmt.Printf("%dth %%ile: %v\n", int(i)*10, vals.Percentile(0.1*i))
}
}
Just scanning the 2016 file, I found the following entries:
The first two are in San Jose and the same price appears twice. The other ones get listed as $99 by the "extract-craigslist" and "calc-medians" scripts. The last one is not in San Francisco.
Do you deduplicate or strip these out anywhere before doing analysis on them? I understand you can work around these issues a little bit by taking the median, but I do worry especially about overreporting at the low end.
Here's a script I used to work around these problems a little bit. I need to add deduplication to it.