The Simulation - Githubissues

What we need is a simulation, just like an RL environment. This will be an episodic environment, and it will represent 1 week. It will not take the form of an MDP, as you saw in the video. Rather, we'll be simulating something where the MDP is only implicit. Here are some things to know about the simulation:

Region: Our 'region' will be represented as a 2D plane. x varies from 0 to 1 and y varies from 0 to 1. Within this region, ride requests will appear, drivers will be present to meet ride requests and money will be made as rides are completed. We can say that .1 = 1 mile.
Time. Time will be represented discretely as each minute over 1 week. So minutes are indexed as 0, ... ,60247 - 1. If this turns out to be computationally expensive, we may lower the granularity later, to say, every 5 minutes or so.
State. The state of the problem is region + time.
Riders: We have R riders. These are entities that have their own price elasticities and they produce ride requests. The reason to have riders and not just ride requests is because we will want a long term price effect in here. Something to the effect of.. "if a rider is shown many expensive rides, they will produce fewer ride requests in the future". Also, accepting expensive rides may change their future price elasticities. We should also use buckets of riders who have different behavior. Maybe we have riders who are party-time goers. Others are commuters.
Ride requests: Riders emit ride requests. Requests appear at a moment in time. They have a starting destination and an ending destination. They take a certain amount of time to complete when matched, which will be a simple function of the straight line distance. These need to be randomly generated and the distribution parameters should depend on the parameters of the group of riders from which the rider is from. Also, ride requests should appear according to a within-day seasonality (rush hour, lunch, evening, night). I'd find some public transport data that is suggestive of this. You don't need to incorporate the data directly; it could just be informative. Also, there should be a weekly seasonality. Friday should be different from Monday which should be different from Saturday, etc.. Friday and Saturday night should be exceptional.
Drivers: Drivers sit idle in some region. It could just be a sub-block from the region (e.g. y in [.1,.2], x in [.4, .5]). They start idling at some random moment in time (but it should have some distribution - e.g. drivers shouldn't be just as likely to idle at 2AM as they are at 1PM). You don't need to consider groups of drivers, like we did with riders. If a request appears in their region and it hasn't been claimed by another driver, they need to drive towards the ride requests and perform the trip. Then they are in the new region of the ride request destination. Travel can be done at some constant speed. The driver also has their own supply elasticity.
Pricing: The price of a ride is: price_of_ride = c + p_per_min * ride_minutes + p_per_mile * ride_miles, where c is some constant. The policy is f and the goal is to learn f: c , p_per_min, p_per_mile = f(x_ride_request_start, y_ride_request_start, time_of_week). For every completed ride, Lyft makes .3*price_of_ride.
Acceptance: Given a ride_request and the availability of drivers, we match the ride_request with any driver in the grid, if there is one. (Note: I say 'any' b/c matching is a difficult process. Even just doing the nearest driver may involve some heavy computing, and I don't want to do that. Just match within the grid) With a match, we calculate the ride price. The ride_request is accepted by the rider given a sigmoid conversation probability, which takes 2 parameters, indicative of their price elasticity. Call them a and b. acceptance probability = sigmoid(a + b*price_of_ride). b is negative. elasticities are implicit in this representation. Don't worry about producing elasticities explicitly. We can just talk about ride's price sensitivity in terms of their a and b's. We will also have a driver acceptance probability, which is like sigmoid(a + b*price_of_ride) as well. b here would be negative. A ride is accepted if both the rider and driver accept the match. Second thought, this is a bit too simple. Drivers really care about their earnings per hour and riders value longer trips more. So in this model, long rides wouldn't happen (because riders would see them all as too expensive most likely). So, we may need to evolve this a bit.
Reward: Lyft's reward for the week is the sum of .3*price_of_ride across all completed rides. This is an undiscounted reward.

Learning

The goal is to optimize reward. Over many simulated weeks, we should see Lyft's reward improve. So f should depend on some parameters theta, and then we should do stochastic gradient descent, a la 'policy gradient methods'. Since there's a lot of sampling in this simulation, gradient-descenting through that might not be easy. To do this, we could 1) use the reparameterization trick (which I can explain) or 2) allow the model to know the drivers and rider elasticity parameters. This way we could construct a differentiable log likelihood that is equivalent in expectation to the reward of the simulation. This is totally cheating, but it's akin to something you might do in reality (e.g. estimate elasticity parameters via experiments and then use this routine, accepting that the routine is just a model of the real environment).

Other considerations

The simulation needs to be as fast as possible. So try to compute everything you can in large arrays. This might be complicated by somethings being necessary sequential/dependent on the past. Things like the movement of the driver after a rider is accepted. I could see that being something we remove, for the sake of speed. Also, we should be using PyTorch for autograd.

Bonus stuff

It might be interesting to make the drivers RL agents as well. What happens in the steady state when both drivers and lyft optimize profit? Given how much there is in the above, I don't suspect we'll get to this.

To Do

@liux3372 What would help is for you to tell me how you would do this. Tell me about the Python classes you'd create, what they'd do and how everything we could together to create a learned policy. I'd like us to converge on an implementation plan before you start executing.

Thank you for giving an initial requirements, here are several questions in terms of the setup.

What are the common practices for the distributions of number of riders and number of drivers based on time of the week and time of day. Is there any paper or real data for reference?
Shall I sample delta R and delta D for every 15mins or so. R and D will add any delta R or delta D sampled(could be negative) and R will decrease based on finished trips.
What would be a good number for max_R and max_D at any given time?(Then we can do vector operations instead of using a loop)
We may add the current distance between the driver and the rider when calculating the acceptance probability for drivers(so that drivers that are more far away tend not to accept)
If a rider’s request was rejected by a driver, shall we randomly ask for another driver?(maybe assume one such request takes 1min, and if there is no driver acceptance in 5mins, the rider leaves away)
If a rider rejects a price, will the rider disappear?
Add the rush hour surcharge to encourage drivers?
Do we use databases? (To store trips, riders and drivers etc.) or it that too complicated.

Let’s align on the system design then we will worry about how to do SGD like you mentioned in the Learning section.

Hey @liux3372, answering your Q's.

'What are the common practices for the distributions ...?'

We don't need to reference papers for this. Just use your intuition. We could say there are R = 1000 riders. 50% are commuters, 20% are party goes and 30% are sporadic. Let's say rider 'r_abc' is a commuter. Then we have a few random values to determine. First, how many requests did 'r_abc' make in the week? That could be a sample of Poisson(lambda). Then, for each ride requests, what time did the ride request appear? Since they are commuters, they should be more likely to take rides between 6-10AM and 4PM-6PM during the week. You could do this by mixing a uniform distribution across all minutes of the week with a gaussians that have humps centered on the commuting hours. Another thing we need a distribution over is the destination. You could make it uniform, but it might be nice to have a circular region in the grid represent 'the city', and so commuters are more likely to go towards to the city. If the grid has some irregularity (e.g. city and airport), that will make for interesting plots at the end. We are trying to create complexity for the RL agent to learn around.

For drivers, there are no buckets of drivers. Drivers sit in a subgrid (e.g. x=[.1, .2], y=[.4,.5]) and wait for ride requests. They can have a random but not-necessarily uniformed distribution over the subgrids. Also, their idle-waiting time should be random, like an exponential random var.

Shall I sample delta R and delta D for every 15mins or ... ?

I see where you're going, but I'd say no. R and D are the number of riders and drivers, so I'm happy to consider those things fixed. If prices are too high, that can make rider less likely to produce ride requests (via some distribution parameter mentioned above). If drivers make too little per hour, maybe they will idle for less time? Idk exactly at the moment.

What would be a good number for max_R and max_D at any given time?

I'd say R = 1000 and D = 100 to start is fine. I like how you are thinking in terms of vector ops up front. We will want to pre-compute the simulation as much as possible in batches of random variables.

We may add the current distance between ..

Smart! I hadn't thought about this. We may factor this in later, but I wouldn't worry about it yet. If matches are far apart, drivers will spend time driving without getting paid, which means their earnings per hour will be low. And maybe that will have some feedback on the simulation (e.g. short idle times later in the week)

Note: Any long term effect actually happens within the same week. Riders/Drivers have no memory between weeks, which makes our life easier.

If a rider’s request was rejected by a driver, shall we...

No, I fear that'll be a slow matching operation. Just assume the driver stays idle and the rid request disappears (e.g. they take out an uber ride instead). Whatever you build, the match action should be fast. That's why I'm thinking 'any driver in the subgrid of the ride request'

If a rider rejects a price, will the rider disappear?

No, the rider doesn't disappear. But maybe the rider will be less likely to produce a ride request later in the week?

Add the rush hour surcharge to encourage drivers?

That could be something we add later, but no need at this point.

Do we use databases? (To store trips, riders and drivers etc.) or it that too complicated.

No, the data wouldn't get that big I don't think. Because we will simulate a week, update the policy, throw the data away, re-simulate, update the policy, etc... At the end, we will have a learned policy and then we'll run a few extra simulations to evaluate/inspect its performance.

Some other comments

The sequential nature of this simulation may slow thing down substantially. One in-between option is that riders and drivers update their future behavior only after each day. That way, we could pre-compute most of a full day, then have next-day distributional parameters update, sample the next day, and so on..

a few things to be mentioned for our catchup:

Things I have done(may do more testing and visualization):

I have simulated a week of rider requests which can be done by a second. Each request will be pushed to a min heap by their creation timestamp, then the heap will process the requests, see if both riders and drivers accept then create a trip(by setting the driver to busy and let the driver come to the pickup location) then finally finish a trip(set the driver to idle and put the driver on the drop-off location).
There are 100 drivers and 1000 riders. Each driver has an initial address sampled from the customized probability density function which represents two cities and one airport on the map. Each rider has a home address, additionally each commuter has a work address. I use rejection sampling to create drivers and riders during initialization.
The ride request follow the below pattern:

Commuters only take ride between home and work addresses on each work day’s rush hours. Party-goers take a ride between home address and other person’s home address at nights For sporadic, 50% of time the rider takes a rider from home to a random location, 50% of time is the other way around.
For each round of simulation, each rider follows a Poisson distribution for the number of requests this week. Different types of riders have a different lambda parameter. Each request day and request type(leave or return) are sampled without replacement. And according to the request type, the request hour will be sampled with a candidate list of values for each rider type. The request minute is sampled uniformly from 0 to 59.

Things to be done next:

Simulate drivers moving behavior and idle time(needs confirmation)
How to update the policy based on states and actions

Things to be done later:

Rush-hour surcharge
Factor the distance between driver’s location and rider’s source destination on the acceptance probability
Try to optimize in a vector space(instead of using a loop)
Define drivers as agents too

@liux3372 Some follow up things from our conversation:

You want to simulate the demand side for 1 day in a large batch of arrays. It should be very fast. Simulate when all the ride requests appear, their location and their destination. These samples comes from the different ride groups, which have different distributional parameters.
You want to sample the supply side for 1 day in a large batch. It should give you the start times of each drivers idling, how long the idle for (an exponential) and where they are idling.
The matching operations, where we go from the supply and demand batches to the rides, should be fast. To do this, you bucket rider and driver positions to 1 of 16 grid squares, created by dividing each size into 4. Match ride requests with idle drivers by matching on their grid cell. Then just do an arbitrary/random matching within the grid cell.
The pricing function can be deterministic for now. We'll optimize it later.
After one day, we should be prepare for the next day distributional parameters to be updated. E.g., if the ride requests were often reject because they were expensive, maybe fewer ride requests will be taken out the next day?

@liux3372 Also, can you create a PR with your code? No need to merge it yet but I'd like to have a look

sure, let me create a PR.

I think I am now better to understand how to simulate riders and drivers using vectors in a day.

@Duane321 How often should we do the matching within one day, btw? I assume during each time_frame of the day, we will do the matching for each of the 16 squares. In each square, we get the riders and drivers within that square, then we check whether the request_time is within the idle_start_time and idle_start_time+idle_duration. In addition we use a set or a binary label to indicate if the driver is busy. And we do the random assignment of driver and the trip acceptance test based on the pricing function.

For the next time_frame of the day, we need to update the busy_drivers set of the driver's busy label based on the last trip_creation_time and the trip_duration for each busy driver. Then we continue the matching and pricing accordingly.

we should be prepare for the next day distributional parameters to be updated:

Yes, I'll have a poisson parameter for each rider based on their ride_type(determined by the start of each week) and num_rejects(determined by end of each day). On each day I'll sampled whether the rider will make the request based on the daily lambda and update the lambda based on num_rejects at the end of each day.

I see what you're saying. Do the match every 30 minutes.

We can at least sample the demand side in one day. Say D is a pytorch tensor representing all the demand samples of one day. It could have shape (RR, 5), where RR is a random variable giving the number of ride requests for the day. D[:,0] gives all the times of the ride_requests, D[:,1] gives the x_start, D[:,2] gives the y_start, D[:,3] gives the x_end and D[:,4] gives the y_end. We should be able to generate D very quickly, in either one or five calls of a random sampler.

Similarity, say S0 gives the supply information at the start of the 0-th half hour interval. So we have S0, ..., S47 for the day. These S's, as you were anticipating, need to be produced in sequence. Also S0 first dimension should have length 100, since that's how many drivers there are. S1 will have a short first dimension, since some drivers will be busy. S2 will have less than 100, but it could be more than S1, because some drivers become available again.

Matching should happen at the start of each half hour interval, like M0 = match(D, S0). For this, I'd discretize all ride request starting positions to 0, ..., 15 to represent their grid cell. Do the same for all driver idling positions. Then I think we'd have to loop over each grid cell and then perform the match. The match operations should all be vectorized. That's what was afforded to us by doing the discretization.

I'll let it to you to figure out how M0 impacts S1. That is, how matches of the first half hour impacts the available drivers in the second half hour. It's important that M0 impacts where the drivers are idle in S1. This is where we get some cross-cell dependency.

We should be aware of a few loops we have:

Many, many simulations. A million if we're lucky.
In each simulation, we loop over 7 days.
In each day, we loop over 48 half hour intervals.
In each half hour interval, we loop over 16 cells.

Other thoughts:

If we have to loop over these half hour interval, then the days don't afford us anything. We might as well loop over 48*7 time windows, and update everything. (Actually, I'm not totally sure of this)
When we loop over the 16 cells, that we should be parallelizable. Maybe we'll whip out some fanciness for that.

Hopefully this makes sense. It's always possible there's some crucial detail of the simulation I'm not anticipating.

@Duane321 apologize if this is too long.

To give a mid-week update and see if I'm doing the right things, here is how I simulate riders and driver and do the matching. Feel free to review and comment.

Riders are simulated as a torch vector of _(num_riders) * (request_times, start_locations, endlocations) Each start_locations is a (x, y) position sampled uniformly on the 0-1 grid. (Let's do the cities and airports later)

Each rider will have a lambda for a Poisson distribution to sample the number of requests every day. Those lambdas will be updated based on the number of rejects happened during the week by riders.

Each request_time is sampled either from a normal distribution with mean of the morning peak hours, afternoon peak hours, evening party hours or uniformly to represent different rider types. For simplicity, I take 250 riders each type (i.e. the first 250 rows in the vector are the riders who will take rides on a normal distributed time with mean of 8 * 60 and std of 60 minutes).

Drivers are simulated as a torch vector of _(num_drivers) * (idle_start_time, idle_duration, driverpositions) Each idle_start_time is sampled with 80% on daytime (7 AM to 10 PM) and 20% the rest of time, with each hour will be picked equally likely. Each idle_duration is sampled from an exp distribution with 1.0 / mean_idle_time as the parameter where mean_idle_time is initially set to 180 minutes. Those mean_idle_times will be updated based on the number of rejects happened during the week by drivers. Each driver_positions are placed on the sub-block 0 to 15 uniformly and each driver's initial position will be randomly assigned within each sub-block.

In terms of matching(work-in-progress), for every 30-min interval, we iterate through each sub-block. For rider requests, we mask on the request timestamp, which has to stay within [idle_start_time, idle_start_time+idle_duration] and the sub-block id; for drivers, we randomly pick a valid driver(if no valid driver, cancel the request right away), do the matching(based on acceptance probs by both parties), and update their idle_start_time to request_time+trip_duration and sub-block id to the destination's subblock id for those accepted trips. To update each driver's locations every day, I have an extra vector of (num_driver) (subblocks) to indicate a driver's location. We do matching by taking masks on this vector to get drivers in the same sub-block. And there is another vector of _(num_driver) (status)_ to show if a driver is valid(i.e. not on a trip).

*If we have to loop over these half hour interval, then the days don't afford us anything. We might as well loop over 487 time windows, and update everything.** We sample number of requests and drivers idle_start_time and idle_duration every day based on their rejects during the week, and we do the matching every 30mins to handle each request and set drivers status(idle or busy). I think by doing this way, we simulate the reality quite well and we are able to handle the computing complexity.

I'll keep working on how to parallelize on the sub-blocks.

note, if there are so many requests not able to find even a valid driver, we may consider adding more total drivers.

Hey, good work! Just some comments:

Riders are simulated as a torch vector of (num_riders) * (request_times, start_locations, end_locations)

Do you mean ride requests? We need to keep separate the idea of the pool of riders and the requests they produce.

Each start_locations is a (x, y) position sampled uniformly on the 0-1 grid. (Let's do the cities and airports later)

Yes!

Each rider will have a lambda for a Poisson distribution to sample the number of requests every day. Those lambdas will be updated based on the number of rejects happened during the week by riders.

Yes!

Each request_time is sampled either from a normal distribution with mean of the morning peak hours, afternoon peak hours, evening party hours or uniformly to represent different rider types. For simplicity, I take 250 riders each type (i.e. the first 250 rows in the vector are the riders who will take rides on a normal distributed time with mean of 8 * 60 and std of 60 minutes).

This sounds right. It'll help me to see this in a plot. For a type of rider, show me their distribution of requests over the data. That is, if this rider makes a requests, what time is it likely to be in? Note that this given a request is made, when does it happen? How many requests happen is dictated with that Poisson, as you mentioned.

Drivers are simulated as a torch vector of (num_drivers) * (idle_start_time, idle_duration, driver_positions) Each idle_start_time is sampled with 80% on daytime (7 AM to 10 PM) and 20% the rest of time, with each hour will be picked equally likely. Each idle_duration is sampled from an exp distribution with 1.0 / mean_idle_time as the parameter where mean_idle_time is initially set to 180 minutes. Those mean_idle_times will be updated based on the number of rejects happened during the week by drivers. Each driver_positions are placed on the sub-block 0 to 15 uniformly and each driver's initial position will be randomly assigned within each sub-block.

Great, yes!

In terms of matching(work-in-progress), for every 30-min interval, we iterate through each sub-block. For rider requests, we mask on the request timestamp, which has to stay within [idle_start_time, idle_start_time+idle_duration] and the sub-block id; for drivers, we randomly pick a valid driver(if no valid driver, cancel the request right away), do the matching(based on acceptance probs by both parties), and update their idle_start_time to request_time+trip_duration and sub-block id to the destination's sub_block id for those accepted trips. To update each driver's locations every day, I have an extra vector of (num_driver) (sub_blocks) to indicate a driver's location. We do matching by taking masks on this vector to get drivers in the same sub-block. And there is another vector of (num_driver) (status) to show if a driver is valid(i.e. not on a trip).

This sounds right. I haven't thought about the matching algo at a granularity where I could debug this, so just give it a shot and let me know if anything trips up.

We sample number of requests and drivers idle_start_time and idle_duration every day based on their rejects during the week, and we do the matching every 30mins to handle each request and set drivers status(idle or busy). I think by doing this way, we simulate the reality quite well and we are able to handle the computing complexity.

Sure, sounds reasonable.

note, if there are so many requests not able to find even a valid driver, we may consider adding more total drivers.

Yes, we will discover that many of these arbitrary values need to be tuned later. We need to anticipate that tuning by making sure things aren't super slow. (We can do a very slow run at the end, with things are properly set).

see the plots below regarding ride requests. Yes, I mean (num_rider_requests) * (request_times, start_locations, end_locations).

rider type

{0: 'commuters_morning',
 1: 'commuters_afternoon',
 2: 'party_goers',
 3: 'sporadic'}

a rider could have two rides during the morning rush hours(not sure if it makes sense), but it's easier for vector initialization.

@Duane321 Let me make a few small fixes as we discussed then create a PR.

I figure out I can make the lambdas of Poisson sampled from Gamma distributions as conjugate priors for Bayesian updates. We initialize alphas and betas to 1.5 or something based on their rider types, and after every day we add alpha by the daily number of accepts and add beta by the daily number of total requests.

One thing I notice is that during the matching, for any given time interval and the sub-block, after we filter the requests based on timestamp and location index, we need to iterate through every such request. It will create another loop but usually there aren't many such requests after filtering, just FYI.

Besides, let's say we do a matching from time 0 to 29. If a driver starts at sub-block 0 during t=0 then finish a trip and now lives in sub-block 10 during t=10, it will be a valid driver only at sub-block 0 since we do the matching every 30mins. We could have a shorter time interval but it will increase runtime. Such issue will always exist unless we check every minute as a trip could take just one minute. I might tune the interval_time later to get a balance of runtime and supply.

Regarding the parallelization, since matching is a CPU bound task(rather than a I/O bound task), multi-processing works better than multi-threading here. However, as each process usually won't share any resource, we need to create two vectors as shared tensors(one to record each rider's rejects, another to record each driver's rejects and update each driver's idle timestamp and new location as we finish every trip). We have to create a lock for synchronizing access to the shared tensor in something like

import torch.multiprocessing as mp

processes = []

lock = mp.Lock()

for i in range(5):
     process = mp.Process(target=process_function, args=(shared_tensor, lock))
     processes.append(process)

Not sure if such lock will delay the parallelization. I'll start this way but feel free to advise anything better we can optimize.

Thanks @liux3372

This all makes sense. I have some opinions, but I'm not interacting with the simulation, so some of them could be uninformed. I'd rather get to seeing the simulated run first, record the data and make plots. At that moment, it'll be easier for me to spot issues and make suggests. If it's slow, we can save speedups for later. (e.g. maybe we hold off on parallelization until we understand the bottlenecks better?)

@Duane321 pls review #3 when you are available.

at the moment, I created a simulation to record the matched trip happing every day.

What I am going to do next:

have a dump of every demand, supply, and trips for each week, test runtime for 100K weeks.
tweak the parameters to produce more matched trips per day.
think of ways to speed up with pytorch if it takes too slowly.
visualize the data(make another class to analyze it downstream).
- e.g. bar plots of number of rides does a driver do per day; density plots of trip start time distribution, trip duration distribution, ride mile distribution; bar plots of acceptance probabilities for riders and drivers, etc.

What we can do later:

set up elasticity (could start interviewing people in the industry).
build at two ways or more to learn the optimal pricing function with RL.
have a dynamic pricing function.

Please lmk if that makes sense, feel free to make comments.

Duane321 / reinforcement_learning_for_rideshare_pricing

The Simulation #1