Open bobaekang opened 7 years ago
After creating distance_m
, you don't want to convert this to a logical matrix. Instead, create a data frame with three columns:
id
from DivvyStation
)The id for the CTA station that is closest to the Divvy station. Basically find which column has the minimum value for each row of distance_m
. This could be a for
loop, or some type of apply
function that iterates over each row (you cannot just use rowSums
for this). This code tells you which column in distance_m
is the minimum distance for each row.
apply(distance_m, 1, function(x) which(min(x)))
From here you can use this to get the actual stop_id
from CTAStops
.
distance_m
With these three pieces of information you could join DivvyStation
and CTAStops
using the new data frame as the link between the two (use some sort of _join
statement(s) like we learned last week). That gives you the joined table, then you could filter or create a new column that identifies if the closest CTA station is within 150 feet
@bensoltoff Thank you for your insight. The following is what I tried:
Divvy_m <- cbind(DivvyStation$lon, DivvyStation$lat)
CTA_m <- cbind(CTAStops$stop_lon, CTAStops$stop_lat)
distance_m <- distm(Divvy_m, CTA_m, fun = distHaversine) # a 535 by 11520 matrix for distance
distance150 <- distance_m <= 150 # check if the distance is <=150 meters or approximately 0.1 mile
proximity150 <- (rowSums(distance150) > 0)*1 # a Divvy station is <=150m from any CTA stop, 1; otherwise, 0
DivvyStation$proximity <- proximity150
index150 <- which(distance_m <= 150, arr.ind = T) # matrix of indices where the distance is <= 150
for (i in range(1, ncol(index150))){ # switching the index number to id numbers
Divvyindex <- index150[i,1]
CTAindex <- index150[i,2]
DivvyId <- DivvyStation$id[Divvyindex]
CTAId <- CTAStops$stop_id[CTAindex]
index150[i,1] <- DivvyId
index150[i,2] <- CTAId
}
colnames(index150) <- c('id', 'stop_id') # matching the column names to those in `DivvyStation` and `CTAStops`
index150 <- index150 %>% as_data_frame()
test <- left_join(DivvyStation, index150)
It seems to work to some extent, although I am seeing mismatches: where the proximity
is 1 but not getting any CTA stops matched, or 'proximity' is 0 but getting CTA stops matched. The following is the first 30 rows of the joined df:
> print(test, n = 30)
# A tibble: 1,990 × 8
id name lat lon dpcapacity online_date proximity stop_id
<int> <chr> <dbl> <dbl> <int> <chr> <dbl> <int>
1 456 2112 W Peterson Ave 41.99118 -87.68359 15 5/12/2015 1 748
2 456 2112 W Peterson Ave 41.99118 -87.68359 15 5/12/2015 1 811
3 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 1318
4 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 1319
5 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 1322
6 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 10581
7 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 11110
8 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 11111
9 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 11337
10 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 11338
11 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 11392
12 101 63rd St Beach 41.78102 -87.57612 23 4/20/2015 0 11502
13 109 900 W Harrison St 41.87468 -87.65002 19 8/6/2013 1 1283
14 109 900 W Harrison St 41.87468 -87.65002 19 8/6/2013 1 9937
15 21 Aberdeen St & Jackson Blvd 41.87773 -87.65479 15 6/21/2013 1 384
16 21 Aberdeen St & Jackson Blvd 41.87773 -87.65479 15 6/21/2013 1 419
17 21 Aberdeen St & Jackson Blvd 41.87773 -87.65479 15 6/21/2013 1 9710
18 21 Aberdeen St & Jackson Blvd 41.87773 -87.65479 15 6/21/2013 1 10084
19 21 Aberdeen St & Jackson Blvd 41.87773 -87.65479 15 6/21/2013 1 10085
20 80 Aberdeen St & Monroe St 41.88046 -87.65393 19 6/26/2013 1 116
21 80 Aberdeen St & Monroe St 41.88046 -87.65393 19 6/26/2013 1 200
22 80 Aberdeen St & Monroe St 41.88046 -87.65393 19 6/26/2013 1 7876
23 346 Ada St & Washington Blvd 41.88283 -87.66121 15 10/10/2013 0 11370
24 346 Ada St & Washington Blvd 41.88283 -87.66121 15 10/10/2013 0 11371
25 346 Ada St & Washington Blvd 41.88283 -87.66121 15 10/10/2013 0 11516
26 341 Adler Planetarium 41.86610 -87.60727 19 10/9/2013 1 3979
27 341 Adler Planetarium 41.86610 -87.60727 19 10/9/2013 1 3980
28 341 Adler Planetarium 41.86610 -87.60727 19 10/9/2013 1 4033
29 444 Albany Ave & 26th St 41.84448 -87.70201 19 4/29/2015 1 1649
30 444 Albany Ave & 26th St 41.84448 -87.70201 19 4/29/2015 1 1680
# ... with 1,960 more rows
I am not exactly sure where it went wrong. I will keep working on this. Thank you very much!
I don't think you're properly isolating the proximate stop. I don't have an API token to reproduce this. Can you push the data and your code as part of this repo? If so I can clone it and test some code.
@bensoltoff Dr. Soltoff. I have added the data and script for this issue. Please take a look at the fp-issue02.R script for reproducing what I have provided above in this issue thread. If I managed it correctly, you should not need any other script or data to make fp-issue02 to run. The code for creating an object DivvyStation_prox
is not in the script, but what I did was a simple filtering, which should look like the following:
DivvyStation_prox <- DivvyStation %>% filter(proximity == 1)
Thank you very much for your help. I will be looking forward to your comment.
##---------------------------------------------------------------##
## This script reads Divvy station and CTA stop data into R, in ##
## order to creat a tidy data frame for Divvy Station data frame ##
## with spatial variable. This is a mini script for the Issue 02 ##
##---------------------------------------------------------------##
library(tidyverse)
library(feather)
library(geosphere)
# Read the CTA stop file
CTAStops <- read_csv("data/stops.txt")
# Read the Divvy station file
DivvyStation <- read_csv("data/Divvy_Trips_2016_Q1Q2/Divvy_Stations_2016_Q1Q2.csv")
colnames(DivvyStation) <- c("id", "name", "lat", "lon", "dpcapacity", "online_date")
# adding a proximity variable to station data
Divvy_m <- cbind(DivvyStation$lon, DivvyStation$lat)
CTA_m <- cbind(CTAStops$stop_lon, CTAStops$stop_lat)
distance_m <- distm(Divvy_m, CTA_m, fun = distHaversine) # a 535 by 11520 matrix for distance
# find which CTA stations are the closest - note that for some Divvy stations,
# there may be multiple CTA stations the same minimum distance away
closest <- apply(distance_m, 1, function(x) which(x == min(x))) %>%
# convert to data frame
map_df(as_data_frame, .id = "id") %>%
rename(Divvyrow = id,
CTArow = value) %>%
# obtain distance from distance_m
mutate(Divvyrow = as.numeric(Divvyrow),
distance = map2_dbl(Divvyrow, CTArow, function(x, y) distance_m[x,y]),
close = if_else(distance <= 150, TRUE, FALSE))
# join DivvyStation and closest
## need to create a new DivvyStation column using row_number()
## to enable the join
DivvyCTA <- DivvyStation %>%
mutate(Divvyrow = row_number()) %>%
left_join(closest) %>%
# now join with CTAStops - also create row_number() columns
left_join(CTAStops %>%
mutate(CTArow = row_number()))
str(DivvyCTA)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 607 obs. of 19 variables:
$ id : int 456 101 109 21 80 346 341 444 511 376 ...
$ name : chr "2112 W Peterson Ave" "63rd St Beach" "900 W Harrison St" "Aberdeen St & Jackson Blvd" ...
$ lat : num 42 41.8 41.9 41.9 41.9 ...
$ lon : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
$ dpcapacity : int 15 23 19 15 19 15 19 19 15 35 ...
$ online_date : chr "5/12/2015" "4/20/2015" "8/6/2013" "6/21/2013" ...
$ Divvyrow : num 1 2 3 4 5 6 7 8 9 10 ...
$ CTArow : int 7774 6854 162 49 324 326 3440 4400 2848 9819 ...
$ distance : num 76.1 633.6 25.5 37.3 142.6 ...
$ close : logi TRUE FALSE TRUE TRUE TRUE FALSE ...
$ stop_id : int 11458 10198 207 60 460 463 4877 6397 4051 15344 ...
$ stop_code : int 11458 10198 207 60 460 463 4877 6397 4051 15344 ...
$ stop_name : chr "Peterson & Target (2100 W)" "Larabida Hospital" "900 W Harrison" "Jackson & Aberdeen" ...
$ stop_desc : chr "Peterson & Target (2100 W), Eastbound, Southside of the Street" "Larabida Hospital, Northbound, Eastside of the Street" "900 W Harrison, Westbound, Northside of the Street" "Jackson & Aberdeen, Eastbound, Southeast Corner" ...
$ stop_lat : num 42 41.8 41.9 41.9 41.9 ...
$ stop_lon : num -87.7 -87.6 -87.6 -87.7 -87.7 ...
$ location_type : int 0 0 0 0 0 0 0 0 0 0 ...
$ parent_station : chr NA NA NA NA ...
$ wheelchair_boarding: int 1 1 1 1 1 1 1 1 1 1 ...
Note the use of row_number()
to create columns in CTAStops
and DivvyStations
that enables the matching using closest
as the bridge data frame. You cannot use the original id columns because they do not match to the row and column ids in distance_m
.
@bensoltoff Thank you for pointing out the mismatch between the original ids and the row/column ids in distance_m
. Because I wanted to keep all CTA stops instead of only minimum distance ones, I have tried a different approach. Here is my code:
# adding two proximity variables to station data
Divvy_m <- cbind(DivvyStation$lon, DivvyStation$lat)
CTA_m <- cbind(CTAStops$stop_lon, CTAStops$stop_lat)
distance_m <- distm(Divvy_m, CTA_m, fun = distHaversine) # a 535 by 11520 matrix for distance
distance150 <- distance_m <= 150 # check if the distance is <=150 meters or approximately 0.1 mile
proximity150 <- (rowSums(distance150) > 0)*1 # binary; a Divvy station is <=150m from any CTA stop, 1; otherwise, 0
DivvyStation$proximity <- proximity150
proximity150_2 <- rowSums(distance150) # non-binary; number of close CTA stops
DivvyStation$prox_num <- proximity150_2
# create a data frame linking Divvy stataions and CTA stops close to each other
index150 <- which(distance150 == TRUE, arr.ind = T) # matrix of indices where the distance is <= 150
index150_df <- as_data_frame(index150)
index150_df <- index150_df[order(index150_df$row, index150_df$col),]
colnames(index150_df) <- c('Divvyindex', 'CTAindex')
DivvyStation$Divvyindex <- sequence(nrow(DivvyStation))
CTAStops$CTAindex <- sequence(nrow(CTAStops))
CTAindex <- CTAStops %>% select(stop_id, stop_name, CTAindex)
DivvyCTAProx <- DivvyStation %>%
left_join(index150_df) %>%
left_join(CTAindex) %>%
select(id, name, proximity, prox_num, stop_id, stop_name)
I added above the column for the number of close CTA stops (prox_num
) to see to verify the result of linking Divvy stations and CTA stops. Fortunately, the result appears to be what I expected:
> print(DivvyCTAProx, n = 20)
# A tibble: 2,094 × 6
id name proximity prox_num stop_id stop_name
<int> <chr> <dbl> <dbl> <int> <chr>
1 456 2112 W Peterson Ave 1 4 11457 Rosehill Cemetery
2 456 2112 W Peterson Ave 1 4 11458 Peterson & Target (2100 W)
3 456 2112 W Peterson Ave 1 4 11483 2100 W Peterson
4 456 2112 W Peterson Ave 1 4 11484 Rosehill Cemetery
5 101 63rd St Beach 0 0 NA <NA>
6 109 900 W Harrison St 1 6 201 900 W Harrison
7 109 900 W Harrison St 1 6 207 900 W Harrison
8 109 900 W Harrison St 1 6 208 Harrison & Morgan
9 109 900 W Harrison St 1 6 30068 UIC-Halsted
10 109 900 W Harrison St 1 6 30069 UIC-Halsted
11 109 900 W Harrison St 1 6 40350 UIC-Halsted
12 21 Aberdeen St & Jackson Blvd 1 2 60 Jackson & Aberdeen
13 21 Aberdeen St & Jackson Blvd 1 2 92 Van Buren & Aberdeen
14 80 Aberdeen St & Monroe St 1 1 460 Madison & Aberdeen
15 346 Ada St & Washington Blvd 0 0 NA <NA>
16 341 Adler Planetarium 1 1 4877 Solidarity Dr & Planetarium
17 444 Albany Ave & 26th St 1 3 6317 26th Street & Albany
18 444 Albany Ave & 26th St 1 3 6318 26th Street & Sacramento
19 444 Albany Ave & 26th St 1 3 6397 26th Street & Albany
20 511 Albany Ave & Bloomingdale Ave 0 0 NA <NA>
# ... with 2,074 more rows
Again, thank you very much for your help! I couldn't have found a satisfactory solution without your comments.
Working on my final project over this weekend, I realized that I actually needed more information than simply adding a dummy variable for proximity, which I asked about in the previous issue post.
More specifically, for those stations in proximity with any CTA stop (i.e.,
proximity == 1
from the last issue post), I need to find out with which CTA stop(s) each Divvy Station is in proximity.I think I may start with getting the data for Divvy stations with
proximity == 1
, usingfilter()
and calling itDivvyStation_prox
. This looks like the following:And, of course, I have the
CTAStops
data:How can I proceed from here?