More question on the proximity variable

Working on my final project over this weekend, I realized that I actually needed more information than simply adding a dummy variable for proximity, which I asked about in the previous issue post.

More specifically, for those stations in proximity with any CTA stop (i.e., proximity == 1 from the last issue post), I need to find out with which CTA stop(s) each Divvy Station is in proximity.

I think I may start with getting the data for Divvy stations with proximity == 1, using filter() and calling it DivvyStation_prox. This looks like the following:

>DivvyStation_prox
# A tibble: 466 × 7
      id                       name      lat       lon dpcapacity online_date proximity
   <int>                      <chr>    <dbl>     <dbl>      <int>       <chr>     <dbl>
1    456        2112 W Peterson Ave 41.99118 -87.68359         15   5/12/2015         1
2    109          900 W Harrison St 41.87468 -87.65002         19    8/6/2013         1
3     21 Aberdeen St & Jackson Blvd 41.87773 -87.65479         15   6/21/2013         1
4     80    Aberdeen St & Monroe St 41.88046 -87.65393         19   6/26/2013         1
5    341          Adler Planetarium 41.86610 -87.60727         19   10/9/2013         1
6    444       Albany Ave & 26th St 41.84448 -87.70201         19   4/29/2015         1
7    376  Artesian Ave & Hubbard St 41.88949 -87.68822         35   4/27/2015         1
8    208      Ashland Ave & 21st St 41.85381 -87.66590         15   8/15/2013         1
9    563      Ashland Ave & 63rd St 41.77937 -87.66484         15   6/20/2016         1
10   565      Ashland Ave & 66th St 41.77407 -87.66381         11   6/21/2016         1
# ... with 456 more rows

And, of course, I have the CTAStops data:

> CTAStops
# A tibble: 11,520 × 9
   stop_id stop_code                 stop_name                                               stop_desc stop_lat  stop_lon
     <int>     <int>                     <chr>                                                   <chr>    <dbl>     <dbl>
1        1         1 Jackson & Austin Terminal Jackson & Austin Terminal, Northeastbound, Bus Terminal 41.87632 -87.77410
2        2         2            5900 W Jackson      5900 W Jackson, Eastbound, Southside of the Street 41.87707 -87.77132
3        3         3          Jackson & Menard    Jackson & Menard, Eastbound, Southside of the Street 41.87696 -87.76975
4        4         4            5700 W Jackson      5700 W Jackson, Eastbound, Southside of the Street 41.87702 -87.76745
5        6         6           Jackson & Lotus            Jackson & Lotus, Eastbound, Southeast Corner 41.87651 -87.76145
6        7         7            5351 W Jackson      5351 W Jackson, Eastbound, Southside of the Street 41.87655 -87.75893
7        8         8        Jackson & Lockwood         Jackson & Lockwood, Eastbound, Southeast Corner 41.87656 -87.75731
8        9         9         Jackson & Laramie          Jackson & Laramie, Eastbound, Southeast Corner 41.87659 -87.75462
9       10        10      Jackson & Leamington       Jackson & Leamington, Eastbound, Southeast Corner 41.87663 -87.75317
10      11        11            5047 W Jackson      5047 W Jackson, Eastbound, Southside of the Street 41.87665 -87.75128
# ... with 11,510 more rows, and 3 more variables: location_type <int>, parent_station <chr>, wheelchair_boarding <int>

How can I proceed from here?

After creating distance_m, you don't want to convert this to a logical matrix. Instead, create a data frame with three columns:

The id for the Divvy station (id from DivvyStation)
The id for the CTA station that is closest to the Divvy station. Basically find which column has the minimum value for each row of distance_m. This could be a for loop, or some type of apply function that iterates over each row (you cannot just use rowSums for this). This code tells you which column in distance_m is the minimum distance for each row.
```
apply(distance_m, 1, function(x) which(min(x)))
```
From here you can use this to get the actual stop_id from CTAStops.
The actual distance between Divvy and the closest CTA station (i.e. the actual value in the cell from distance_m

With these three pieces of information you could join DivvyStation and CTAStops using the new data frame as the link between the two (use some sort of _join statement(s) like we learned last week). That gives you the joined table, then you could filter or create a new column that identifies if the closest CTA station is within 150 feet

@bensoltoff Thank you for your insight. The following is what I tried:

Divvy_m <- cbind(DivvyStation$lon, DivvyStation$lat)
CTA_m <- cbind(CTAStops$stop_lon, CTAStops$stop_lat)
distance_m <- distm(Divvy_m, CTA_m, fun = distHaversine) # a 535 by 11520 matrix for distance

distance150 <- distance_m <= 150 # check if the distance is <=150 meters or approximately 0.1 mile
proximity150 <- (rowSums(distance150) > 0)*1 # a Divvy station is <=150m from any CTA stop, 1; otherwise, 0 
DivvyStation$proximity <- proximity150

index150 <- which(distance_m <= 150, arr.ind = T) # matrix of indices where the distance is <= 150
for (i in range(1, ncol(index150))){ # switching the index number to id numbers
  Divvyindex <- index150[i,1]
  CTAindex <- index150[i,2]
  DivvyId <- DivvyStation$id[Divvyindex]
  CTAId <- CTAStops$stop_id[CTAindex]
  index150[i,1] <- DivvyId
  index150[i,2] <- CTAId
}
colnames(index150) <- c('id', 'stop_id') # matching the column names to those in `DivvyStation` and `CTAStops`
index150 <- index150 %>% as_data_frame()
test <- left_join(DivvyStation, index150)

It seems to work to some extent, although I am seeing mismatches: where the proximity is 1 but not getting any CTA stops matched, or 'proximity' is 0 but getting CTA stops matched. The following is the first 30 rows of the joined df:

> print(test, n = 30)
# A tibble: 1,990 × 8
      id                       name      lat       lon dpcapacity online_date proximity stop_id
   <int>                      <chr>    <dbl>     <dbl>      <int>       <chr>     <dbl>   <int>
1    456        2112 W Peterson Ave 41.99118 -87.68359         15   5/12/2015         1     748
2    456        2112 W Peterson Ave 41.99118 -87.68359         15   5/12/2015         1     811
3    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0    1318
4    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0    1319
5    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0    1322
6    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   10581
7    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   11110
8    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   11111
9    101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   11337
10   101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   11338
11   101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   11392
12   101              63rd St Beach 41.78102 -87.57612         23   4/20/2015         0   11502
13   109          900 W Harrison St 41.87468 -87.65002         19    8/6/2013         1    1283
14   109          900 W Harrison St 41.87468 -87.65002         19    8/6/2013         1    9937
15    21 Aberdeen St & Jackson Blvd 41.87773 -87.65479         15   6/21/2013         1     384
16    21 Aberdeen St & Jackson Blvd 41.87773 -87.65479         15   6/21/2013         1     419
17    21 Aberdeen St & Jackson Blvd 41.87773 -87.65479         15   6/21/2013         1    9710
18    21 Aberdeen St & Jackson Blvd 41.87773 -87.65479         15   6/21/2013         1   10084
19    21 Aberdeen St & Jackson Blvd 41.87773 -87.65479         15   6/21/2013         1   10085
20    80    Aberdeen St & Monroe St 41.88046 -87.65393         19   6/26/2013         1     116
21    80    Aberdeen St & Monroe St 41.88046 -87.65393         19   6/26/2013         1     200
22    80    Aberdeen St & Monroe St 41.88046 -87.65393         19   6/26/2013         1    7876
23   346   Ada St & Washington Blvd 41.88283 -87.66121         15  10/10/2013         0   11370
24   346   Ada St & Washington Blvd 41.88283 -87.66121         15  10/10/2013         0   11371
25   346   Ada St & Washington Blvd 41.88283 -87.66121         15  10/10/2013         0   11516
26   341          Adler Planetarium 41.86610 -87.60727         19   10/9/2013         1    3979
27   341          Adler Planetarium 41.86610 -87.60727         19   10/9/2013         1    3980
28   341          Adler Planetarium 41.86610 -87.60727         19   10/9/2013         1    4033
29   444       Albany Ave & 26th St 41.84448 -87.70201         19   4/29/2015         1    1649
30   444       Albany Ave & 26th St 41.84448 -87.70201         19   4/29/2015         1    1680
# ... with 1,960 more rows

I am not exactly sure where it went wrong. I will keep working on this. Thank you very much!

I don't think you're properly isolating the proximate stop. I don't have an API token to reproduce this. Can you push the data and your code as part of this repo? If so I can clone it and test some code.

@bensoltoff Dr. Soltoff. I have added the data and script for this issue. Please take a look at the fp-issue02.R script for reproducing what I have provided above in this issue thread. If I managed it correctly, you should not need any other script or data to make fp-issue02 to run. The code for creating an object DivvyStation_prox is not in the script, but what I did was a simple filtering, which should look like the following:

DivvyStation_prox <- DivvyStation %>% filter(proximity == 1)

Thank you very much for your help. I will be looking forward to your comment.

##---------------------------------------------------------------##
## This script reads Divvy station and CTA stop data into R, in  ##
## order to creat a tidy data frame for Divvy Station data frame ##
## with spatial variable. This is a mini script for the Issue 02 ##
##---------------------------------------------------------------##

library(tidyverse)
library(feather)
library(geosphere)

# Read the CTA stop file
CTAStops <- read_csv("data/stops.txt")

# Read the Divvy station file
DivvyStation <- read_csv("data/Divvy_Trips_2016_Q1Q2/Divvy_Stations_2016_Q1Q2.csv")
colnames(DivvyStation) <- c("id", "name", "lat", "lon", "dpcapacity", "online_date")

# adding a proximity variable to station data
Divvy_m <- cbind(DivvyStation$lon, DivvyStation$lat)
CTA_m <- cbind(CTAStops$stop_lon, CTAStops$stop_lat)
distance_m <- distm(Divvy_m, CTA_m, fun = distHaversine) # a 535 by 11520 matrix for distance

# find which CTA stations are the closest - note that for some Divvy stations,
# there may be multiple CTA stations the same minimum distance away
closest <- apply(distance_m, 1, function(x) which(x == min(x))) %>%
  # convert to data frame
  map_df(as_data_frame, .id = "id") %>%
  rename(Divvyrow = id,
         CTArow = value) %>%
  # obtain distance from distance_m
  mutate(Divvyrow = as.numeric(Divvyrow),
         distance = map2_dbl(Divvyrow, CTArow, function(x, y) distance_m[x,y]),
         close = if_else(distance <= 150, TRUE, FALSE))

# join DivvyStation and closest
## need to create a new DivvyStation column using row_number()
## to enable the join
DivvyCTA <- DivvyStation %>%
  mutate(Divvyrow = row_number()) %>%
  left_join(closest) %>%
  # now join with CTAStops - also create row_number() columns
  left_join(CTAStops %>%
              mutate(CTArow = row_number()))
str(DivvyCTA)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   607 obs. of  19 variables:
 $ id                 : int  456 101 109 21 80 346 341 444 511 376 ...
 $ name               : chr  "2112 W Peterson Ave" "63rd St Beach" "900 W Harrison St" "Aberdeen St & Jackson Blvd" ...
 $ lat                : num  42 41.8 41.9 41.9 41.9 ...
 $ lon                : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
 $ dpcapacity         : int  15 23 19 15 19 15 19 19 15 35 ...
 $ online_date        : chr  "5/12/2015" "4/20/2015" "8/6/2013" "6/21/2013" ...
 $ Divvyrow           : num  1 2 3 4 5 6 7 8 9 10 ...
 $ CTArow             : int  7774 6854 162 49 324 326 3440 4400 2848 9819 ...
 $ distance           : num  76.1 633.6 25.5 37.3 142.6 ...
 $ close              : logi  TRUE FALSE TRUE TRUE TRUE FALSE ...
 $ stop_id            : int  11458 10198 207 60 460 463 4877 6397 4051 15344 ...
 $ stop_code          : int  11458 10198 207 60 460 463 4877 6397 4051 15344 ...
 $ stop_name          : chr  "Peterson & Target (2100 W)" "Larabida Hospital" "900 W Harrison" "Jackson & Aberdeen" ...
 $ stop_desc          : chr  "Peterson & Target (2100 W), Eastbound, Southside of the Street" "Larabida Hospital, Northbound, Eastside of the Street" "900 W Harrison, Westbound, Northside of the Street" "Jackson & Aberdeen, Eastbound, Southeast Corner" ...
 $ stop_lat           : num  42 41.8 41.9 41.9 41.9 ...
 $ stop_lon           : num  -87.7 -87.6 -87.6 -87.7 -87.7 ...
 $ location_type      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ parent_station     : chr  NA NA NA NA ...
 $ wheelchair_boarding: int  1 1 1 1 1 1 1 1 1 1 ...

Note the use of row_number() to create columns in CTAStops and DivvyStations that enables the matching using closest as the bridge data frame. You cannot use the original id columns because they do not match to the row and column ids in distance_m.

@bensoltoff Thank you for pointing out the mismatch between the original ids and the row/column ids in distance_m. Because I wanted to keep all CTA stops instead of only minimum distance ones, I have tried a different approach. Here is my code:

# adding two proximity variables to station data
Divvy_m <- cbind(DivvyStation$lon, DivvyStation$lat)
CTA_m <- cbind(CTAStops$stop_lon, CTAStops$stop_lat)
distance_m <- distm(Divvy_m, CTA_m, fun = distHaversine) # a 535 by 11520 matrix for distance

distance150 <- distance_m <= 150 # check if the distance is <=150 meters or approximately 0.1 mile

proximity150 <- (rowSums(distance150) > 0)*1 # binary; a Divvy station is <=150m from any CTA stop, 1; otherwise, 0 
DivvyStation$proximity <- proximity150
proximity150_2 <- rowSums(distance150) # non-binary; number of close CTA stops
DivvyStation$prox_num <- proximity150_2

# create a data frame linking Divvy stataions and CTA stops close to each other
index150 <- which(distance150 == TRUE, arr.ind = T) # matrix of indices where the distance is <= 150
index150_df <- as_data_frame(index150)
index150_df <- index150_df[order(index150_df$row, index150_df$col),]
colnames(index150_df) <- c('Divvyindex', 'CTAindex')

DivvyStation$Divvyindex <- sequence(nrow(DivvyStation))
CTAStops$CTAindex <- sequence(nrow(CTAStops))
CTAindex <- CTAStops %>% select(stop_id, stop_name, CTAindex)

DivvyCTAProx <- DivvyStation %>%
  left_join(index150_df) %>% 
  left_join(CTAindex) %>%
  select(id, name, proximity, prox_num, stop_id, stop_name)

I added above the column for the number of close CTA stops (prox_num) to see to verify the result of linking Divvy stations and CTA stops. Fortunately, the result appears to be what I expected:

> print(DivvyCTAProx, n = 20)
# A tibble: 2,094 × 6
      id                          name proximity prox_num stop_id                   stop_name
   <int>                         <chr>     <dbl>    <dbl>   <int>                       <chr>
1    456           2112 W Peterson Ave         1        4   11457           Rosehill Cemetery
2    456           2112 W Peterson Ave         1        4   11458  Peterson & Target (2100 W)
3    456           2112 W Peterson Ave         1        4   11483             2100 W Peterson
4    456           2112 W Peterson Ave         1        4   11484           Rosehill Cemetery
5    101                 63rd St Beach         0        0      NA                        <NA>
6    109             900 W Harrison St         1        6     201              900 W Harrison
7    109             900 W Harrison St         1        6     207              900 W Harrison
8    109             900 W Harrison St         1        6     208           Harrison & Morgan
9    109             900 W Harrison St         1        6   30068                 UIC-Halsted
10   109             900 W Harrison St         1        6   30069                 UIC-Halsted
11   109             900 W Harrison St         1        6   40350                 UIC-Halsted
12    21    Aberdeen St & Jackson Blvd         1        2      60          Jackson & Aberdeen
13    21    Aberdeen St & Jackson Blvd         1        2      92        Van Buren & Aberdeen
14    80       Aberdeen St & Monroe St         1        1     460          Madison & Aberdeen
15   346      Ada St & Washington Blvd         0        0      NA                        <NA>
16   341             Adler Planetarium         1        1    4877 Solidarity Dr & Planetarium
17   444          Albany Ave & 26th St         1        3    6317        26th Street & Albany
18   444          Albany Ave & 26th St         1        3    6318    26th Street & Sacramento
19   444          Albany Ave & 26th St         1        3    6397        26th Street & Albany
20   511 Albany Ave & Bloomingdale Ave         0        0      NA                        <NA>
# ... with 2,074 more rows

Again, thank you very much for your help! I couldn't have found a satisfactory solution without your comments.

cfss-old / fp-bobaekang

More question on the proximity variable #2