Chicago / sdp-feature-fetcher

Software to extract and calculate features from various data sources for use in smart data platform (https://chicago.github.io/smart-data-platform/)
MIT License
0 stars 0 forks source link

Density metrics - Window calculations #4

Open geneorama opened 7 years ago

geneorama commented 7 years ago

Originally in the food inspection model we calculated a custom heatmap for each inspection one at a time. The calculation actually only calculates one z value for the single point for one inspection at a time, but it has to calculate a normal density value for each event in the x and y direction, scale the vectors and aggregate the values. (So, although there is only a single numeric outcome there are a lot of costly calculations behind each calculation).

In the app for the Lucky Strike project I was looking at a single sample of inspections over 90 days. I treated those inspections as if they had all happened on a single day, so for the window prior to that single point in time all the event data was summarized into a 2 dimensional KDE that spanned the city. From that KDE calculation any inspection could be looked up within the grid (I also wrote something to interpolate between the four nearest points).

The second approach made it possible to quickly explore a large range of parameters. It was also possible to test the effect of changing the fineness of the grid size. I was surprised by how coarse I could keep the grid and maintain the same statistical significance, but I don't remember the best grid size right now.

From the food inpection model (the first approach) the key steps are here (source):

## Create index values for pages
N <- nrow(inspections)
START_ROWS <- seq(1, N, page_limit)
END_ROWS <- c(seq(1, N, page_limit)[-1] - 1, N)
II <- mapply(`:`, START_ROWS, END_ROWS)

## Within each processing block, for each Inspection_ID, 
## Calculate the KDE for that inspection based on citywide Lat / Lon values
## for whatever event. 
rbindlist(lapply(II, function(ii) {
    foverlaps(    
        x = inspections[i = ii, 
                        j = list(Inspection_ID, 
                                 Latitude, 
                                 Longitude), 
                        keyby = list(start = Inspection_Date - window, 
                                     end = Inspection_Date)],
        y = observed_values[i = TRUE, 
                            j = list(Latitude, Longitude),
                            keyby = list(start = Date,  end = Date)],
        type = "any")[ , kde(new=c(i.Latitude[1], i.Longitude[1]), 
                             x = Latitude, 
                             y = Longitude, 
                             h = c(.01, .01)),
                       keyby = Inspection_ID]}))

The main computation of the KDE calculation is carried out just for the new point (source):

nx <- length(x)
h <- h/4
ax <- (new[1] - x) / h[1L]
ay <- (new[2] - y) / h[2L]
z <- tcrossprod(matrix(dnorm(ax), , nx), 
                matrix(dnorm(ay), , nx)) / (nx * h[1L] * h[2L])

The basis for this code appears to be MASS::kde2d, but there is a key difference. The function kde2d returns the entire density grid.

geneorama commented 7 years ago

I was going to include the details for the second approach, but after talking to @tomschenkjr I'm going focus on the first approach.

geneorama commented 7 years ago

Working example to test calculations (which steps make which calculations)

## Example data
x = c(.5, 1, 1.25, 1.25, 1, 1.5, 2.5, 1 )
y = c(.5, .5, .5, 1, 1.5, 1.5, 1.5, 2)
df_new <- data.frame(x = c(0.5, 1.0, 1.5, 3.0),
                     y = c(1.0, 1.0, 1.0, 2.5))
new <- unlist(df_new[3, ])
h = c(1, 1)
h <- h/4

## kde from food inspection
nx <- length(x)
ax <- (new[1]-x) / h[1L]
ay <- (new[2]-y) / h[2L]
z0 <- tcrossprod(matrix(dnorm(ax), , nx), 
                 matrix(dnorm(ay), , nx)) / (nx * h[1L] * h[2L])
z0
##           [,1]
## [1,] 0.2739752

Second approach:

## kde2d
MASS::kde2d
## function (x, y, h, n = 25, lims = c(range(x), range(y))) 
## {
##     nx <- length(x)
##     if (length(y) != nx) 
##         stop("data vectors must be the same length")
##     if (any(!is.finite(x)) || any(!is.finite(y))) 
##         stop("missing or infinite values in the data are not allowed")
##     if (any(!is.finite(lims))) 
##         stop("only finite values are allowed in 'lims'")
##     n <- rep(n, length.out = 2L)
##     gx <- seq.int(lims[1L], lims[2L], length.out = n[1L])
##     gy <- seq.int(lims[3L], lims[4L], length.out = n[2L])
##     h <- if (missing(h)) 
##         c(bandwidth.nrd(x), bandwidth.nrd(y))
##     else rep(h, length.out = 2L)
##     if (any(h <= 0)) 
##         stop("bandwidths must be strictly positive")
##     h <- h/4
##     ax <- outer(gx, x, "-")/h[1L]
##     ay <- outer(gy, y, "-")/h[2L]
##     z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay), 
##         , nx))/(nx * h[1L] * h[2L])
##     list(x = gx, y = gy, z = z)
## }
## <bytecode: 0x04710450>
## <environment: namespace:MASS>

n = 25
lims = c(range(x), range(y)) 

nx <- length(x)
n <- rep(n, length.out = 2L)
gx <- seq.int(lims[1L], lims[2L], length.out = n[1L])
gy <- seq.int(lims[3L], lims[4L], length.out = n[2L])
ax <- outer(gx, x, "-") / h[1L]
ay <- outer(gy, y, "-") / h[2L]
z <- tcrossprod(matrix(dnorm(ax), , nx), 
                matrix(dnorm(ay), , nx)) / (nx * h[1L] * h[2L])
z

here z is a 25 by 25 output, and ax and ay are matricies rather than single vectors as in the first approach.

## plot
plot(x,y, xlim=c(0,3), ylim = c(0,3), pch = "X", col = "blue")
points(new["x"], new["y"], pch = "O", col = "red")
contour(x=gx, y=gy, z, nlevels=5, xlim=c(0,3), ylim = c(0,3), add = TRUE)

image