Open bobaekang opened 7 years ago
On second thought, I came to wonder if I can use only a random sample of trips using sample_n()
, rather than using the entire data. Would this sampling approach be justifiable?
Thank you very much.
A random sample could work, though I wonder what in the code is slowing the operation down. Have you run profvis()
on your function and sample of input/output observations? What part is taking the longest?
If you push your latest commit to Github, I can also help profile your code. This is why keeping your commits synced with Github is useful
I wrote some code to tell me whether each Divvy trip is likely to be a multi-modal trip. The input is either A) a dataframe of the information on the start of each Divvy trip or B) a dataframe of the information on the end of each Divvy trip.
For A): 1) the trip starts at a Divvy station in proximity with any CTA stops 2) the trip starts 3 minutes or less after a public transit arrives from any nearby CTA stops
For B): 1) the trip finishes at a Divvy station in proximity with any CTA stops (<=50m, or one quarter of a block) 2) the trip finishes 3 minutes or less before a public transit departs from any nearby CTA stops
My code does the job, adding two columns (
multimode
andmultimode_num
) to the input.multimode
is a binary variable, 1 for trips that meets the standard and 0 for the others.multimode_num
is the number of possible connections.My problem is that the code is too slow. I tried for only 5000 observations for one of the directions, and it takes minutes on my machine to do the job. I have total 1.6 million observations for both directions.
Is there any way for me to do this more effectively, any resources for distributed computing? I tried
multidplyr
but my function does not work with it.Here are the two key functions I use:
And here is an input example:
The
Departure
object, which is used for the `multiFuncFrom(), looks like this: