cfss-old / fp-bobaekang

fp-bobaekang created by GitHub Classroom
0 stars 1 forks source link

Code too slow #4

Open bobaekang opened 7 years ago

bobaekang commented 7 years ago

I wrote some code to tell me whether each Divvy trip is likely to be a multi-modal trip. The input is either A) a dataframe of the information on the start of each Divvy trip or B) a dataframe of the information on the end of each Divvy trip.

For A): 1) the trip starts at a Divvy station in proximity with any CTA stops 2) the trip starts 3 minutes or less after a public transit arrives from any nearby CTA stops

For B): 1) the trip finishes at a Divvy station in proximity with any CTA stops (<=50m, or one quarter of a block) 2) the trip finishes 3 minutes or less before a public transit departs from any nearby CTA stops

My code does the job, adding two columns (multimode and multimode_num) to the input. multimode is a binary variable, 1 for trips that meets the standard and 0 for the others. multimode_num is the number of possible connections.

My problem is that the code is too slow. I tried for only 5000 observations for one of the directions, and it takes minutes on my machine to do the job. I have total 1.6 million observations for both directions.

Is there any way for me to do this more effectively, any resources for distributed computing? I tried multidplyr but my function does not work with it.

Here are the two key functions I use:

multiToFunc <- function(toInput){
  data <- left_join(toInput, Arrival)
  stop <- as_date(data$stoptime)
  arr <- ymd_hms(str_c(as_date(data$stoptime), data$arrival_time, sep = " "), tz = "America/Chicago")
  data$close <- (3 >= abs(difftime(data$stoptime, arr, tz = "America/Chicago", units = c("mins"))))*1
  close_var <- data %>%
    group_by(trip_id) %>%
    summarise(multimode_num = sum(close == 1, na.rm = TRUE))
  close_var$multimode <- as.logical(close_var$multimode_num)*1
  output <- toInput %>% left_join(close_var)
  return(output)
}

multiFromFunc <- function(fromInput){
  data <- left_join(fromInput, Departure)
  stop <- as_date(data$starttime)
  dep <- ymd_hms(str_c(as_date(data$starttime), data$depart_time, sep = " "), tz = "America/Chicago")
  data$close <- (3 >= abs(difftime(data$starttime, dep, tz = "America/Chicago", units = c("mins"))))*1
  close_var <- data %>%
    group_by(trip_id) %>%
    summarise(multimode_num = sum(close == 1, na.rm = TRUE))
  close_var$multimode <- as.logical(close_var$multimode_num)*1
  output <- fromInput %>% left_join(close_var)
  return(output)
}

And here is an input example:

  ##DivvyData_fromtest
# A tibble: 100 × 12
   trip_id           starttime tripduration from_station_id              from_station_name   usertype gender birthyear
     <int>              <dttm>        <int>           <int>                          <chr>      <chr>  <chr>     <int>
1  9379901 2016-04-30 23:59:00          733             123 California Ave & Milwaukee Ave Subscriber   Male      1982
2  9379900 2016-04-30 23:58:00          556             349    Halsted St & Wrightwood Ave Subscriber   Male      1991
3  9379897 2016-04-30 23:52:00         1146             239       Western Ave & Leland Ave   Customer   <NA>        NA
4  9379896 2016-04-30 23:49:00         1291             239       Western Ave & Leland Ave   Customer   <NA>        NA
5  9379895 2016-04-30 23:46:00          451              56      Desplaines St & Kinzie St Subscriber   Male      1988
6  9379894 2016-04-30 23:45:00         1954             129      Blue Island Ave & 18th St Subscriber   Male      1992
7  9379893 2016-04-30 23:38:00          226             300           Broadway & Barry Ave Subscriber   Male      1984
8  9379891 2016-04-30 23:36:00          369             131      Lincoln Ave & Belmont Ave Subscriber   Male      1991
9  9379890 2016-04-30 23:35:00          201             318 Southport Ave & Irving Park Rd Subscriber   Male      1974
10 9379889 2016-04-30 23:32:00          559             301         Clark St & Schiller St Subscriber Female      1986
# ... with 90 more rows, and 4 more variables: from_lon <dbl>, from_lat <dbl>, from_prox <dbl>, from_prox_num <dbl>

The Departure object, which is used for the `multiFuncFrom(), looks like this:

Departure
# A tibble: 339,112 × 6
   from_station_id fro_prox from_prox_num stop_id      stop_name depart_time
             <int>    <dbl>         <dbl>   <int>          <chr>      <time>
1              109        1             1     207 900 W Harrison    06:21:09
2              109        1             1     207 900 W Harrison    06:21:09
3              109        1             1     207 900 W Harrison    06:43:39
4              109        1             1     207 900 W Harrison    06:43:39
5              109        1             1     207 900 W Harrison    07:06:09
6              109        1             1     207 900 W Harrison    07:06:09
7              109        1             1     207 900 W Harrison    16:04:06
8              109        1             1     207 900 W Harrison    16:04:06
9              109        1             1     207 900 W Harrison    20:52:39
10             109        1             1     207 900 W Harrison    20:52:39
# ... with 339,102 more rows
bobaekang commented 7 years ago

On second thought, I came to wonder if I can use only a random sample of trips using sample_n(), rather than using the entire data. Would this sampling approach be justifiable?

Thank you very much.

bensoltoff commented 7 years ago

A random sample could work, though I wonder what in the code is slowing the operation down. Have you run profvis() on your function and sample of input/output observations? What part is taking the longest?

bensoltoff commented 7 years ago

If you push your latest commit to Github, I can also help profile your code. This is why keeping your commits synced with Github is useful