Duane321 / reinforcement_learning_for_rideshare_pricing

MIT License
2 stars 0 forks source link

The Simulation #1

Open Duane321 opened 2 months ago

Duane321 commented 2 months ago

What we need is a simulation, just like an RL environment. This will be an episodic environment, and it will represent 1 week. It will not take the form of an MDP, as you saw in the video. Rather, we'll be simulating something where the MDP is only implicit. Here are some things to know about the simulation:

Learning

The goal is to optimize reward. Over many simulated weeks, we should see Lyft's reward improve. So f should depend on some parameters theta, and then we should do stochastic gradient descent, a la 'policy gradient methods'. Since there's a lot of sampling in this simulation, gradient-descenting through that might not be easy. To do this, we could 1) use the reparameterization trick (which I can explain) or 2) allow the model to know the drivers and rider elasticity parameters. This way we could construct a differentiable log likelihood that is equivalent in expectation to the reward of the simulation. This is totally cheating, but it's akin to something you might do in reality (e.g. estimate elasticity parameters via experiments and then use this routine, accepting that the routine is just a model of the real environment).

Other considerations

The simulation needs to be as fast as possible. So try to compute everything you can in large arrays. This might be complicated by somethings being necessary sequential/dependent on the past. Things like the movement of the driver after a rider is accepted. I could see that being something we remove, for the sake of speed. Also, we should be using PyTorch for autograd.

Bonus stuff

To Do

@liux3372 What would help is for you to tell me how you would do this. Tell me about the Python classes you'd create, what they'd do and how everything we could together to create a learned policy. I'd like us to converge on an implementation plan before you start executing.

liux3372 commented 2 months ago

Thank you for giving an initial requirements, here are several questions in terms of the setup.

Let’s align on the system design then we will worry about how to do SGD like you mentioned in the Learning section.

Duane321 commented 2 months ago

Hey @liux3372, answering your Q's.

'What are the common practices for the distributions ...?'

We don't need to reference papers for this. Just use your intuition. We could say there are R = 1000 riders. 50% are commuters, 20% are party goes and 30% are sporadic. Let's say rider 'r_abc' is a commuter. Then we have a few random values to determine. First, how many requests did 'r_abc' make in the week? That could be a sample of Poisson(lambda). Then, for each ride requests, what time did the ride request appear? Since they are commuters, they should be more likely to take rides between 6-10AM and 4PM-6PM during the week. You could do this by mixing a uniform distribution across all minutes of the week with a gaussians that have humps centered on the commuting hours. Another thing we need a distribution over is the destination. You could make it uniform, but it might be nice to have a circular region in the grid represent 'the city', and so commuters are more likely to go towards to the city. If the grid has some irregularity (e.g. city and airport), that will make for interesting plots at the end. We are trying to create complexity for the RL agent to learn around.

For drivers, there are no buckets of drivers. Drivers sit in a subgrid (e.g. x=[.1, .2], y=[.4,.5]) and wait for ride requests. They can have a random but not-necessarily uniformed distribution over the subgrids. Also, their idle-waiting time should be random, like an exponential random var.

Shall I sample delta R and delta D for every 15mins or ... ?

I see where you're going, but I'd say no. R and D are the number of riders and drivers, so I'm happy to consider those things fixed. If prices are too high, that can make rider less likely to produce ride requests (via some distribution parameter mentioned above). If drivers make too little per hour, maybe they will idle for less time? Idk exactly at the moment.

What would be a good number for max_R and max_D at any given time?

I'd say R = 1000 and D = 100 to start is fine. I like how you are thinking in terms of vector ops up front. We will want to pre-compute the simulation as much as possible in batches of random variables.

We may add the current distance between ..

Smart! I hadn't thought about this. We may factor this in later, but I wouldn't worry about it yet. If matches are far apart, drivers will spend time driving without getting paid, which means their earnings per hour will be low. And maybe that will have some feedback on the simulation (e.g. short idle times later in the week)

Note: Any long term effect actually happens within the same week. Riders/Drivers have no memory between weeks, which makes our life easier.

If a rider’s request was rejected by a driver, shall we...

No, I fear that'll be a slow matching operation. Just assume the driver stays idle and the rid request disappears (e.g. they take out an uber ride instead). Whatever you build, the match action should be fast. That's why I'm thinking 'any driver in the subgrid of the ride request'

If a rider rejects a price, will the rider disappear?

No, the rider doesn't disappear. But maybe the rider will be less likely to produce a ride request later in the week?

Add the rush hour surcharge to encourage drivers?

That could be something we add later, but no need at this point.

Do we use databases? (To store trips, riders and drivers etc.) or it that too complicated.

No, the data wouldn't get that big I don't think. Because we will simulate a week, update the policy, throw the data away, re-simulate, update the policy, etc... At the end, we will have a learned policy and then we'll run a few extra simulations to evaluate/inspect its performance.

Some other comments

liux3372 commented 2 months ago

a few things to be mentioned for our catchup:

Things I have done(may do more testing and visualization):

Things to be done next:

Things to be done later:

Duane321 commented 2 months ago

@liux3372 Some follow up things from our conversation:

Duane321 commented 2 months ago

@liux3372 Also, can you create a PR with your code? No need to merge it yet but I'd like to have a look

liux3372 commented 2 months ago

sure, let me create a PR.

I think I am now better to understand how to simulate riders and drivers using vectors in a day.

@Duane321 How often should we do the matching within one day, btw? I assume during each time_frame of the day, we will do the matching for each of the 16 squares. In each square, we get the riders and drivers within that square, then we check whether the request_time is within the idle_start_time and idle_start_time+idle_duration. In addition we use a set or a binary label to indicate if the driver is busy. And we do the random assignment of driver and the trip acceptance test based on the pricing function.

For the next time_frame of the day, we need to update the busy_drivers set of the driver's busy label based on the last trip_creation_time and the trip_duration for each busy driver. Then we continue the matching and pricing accordingly.

we should be prepare for the next day distributional parameters to be updated:

Yes, I'll have a poisson parameter for each rider based on their ride_type(determined by the start of each week) and num_rejects(determined by end of each day). On each day I'll sampled whether the rider will make the request based on the daily lambda and update the lambda based on num_rejects at the end of each day.

Duane321 commented 2 months ago

I see what you're saying. Do the match every 30 minutes.

We can at least sample the demand side in one day. Say D is a pytorch tensor representing all the demand samples of one day. It could have shape (RR, 5), where RR is a random variable giving the number of ride requests for the day. D[:,0] gives all the times of the ride_requests, D[:,1] gives the x_start, D[:,2] gives the y_start, D[:,3] gives the x_end and D[:,4] gives the y_end. We should be able to generate D very quickly, in either one or five calls of a random sampler.

Similarity, say S0 gives the supply information at the start of the 0-th half hour interval. So we have S0, ..., S47 for the day. These S's, as you were anticipating, need to be produced in sequence. Also S0 first dimension should have length 100, since that's how many drivers there are. S1 will have a short first dimension, since some drivers will be busy. S2 will have less than 100, but it could be more than S1, because some drivers become available again.

Matching should happen at the start of each half hour interval, like M0 = match(D, S0). For this, I'd discretize all ride request starting positions to 0, ..., 15 to represent their grid cell. Do the same for all driver idling positions. Then I think we'd have to loop over each grid cell and then perform the match. The match operations should all be vectorized. That's what was afforded to us by doing the discretization.

I'll let it to you to figure out how M0 impacts S1. That is, how matches of the first half hour impacts the available drivers in the second half hour. It's important that M0 impacts where the drivers are idle in S1. This is where we get some cross-cell dependency.

We should be aware of a few loops we have:

Other thoughts:

Hopefully this makes sense. It's always possible there's some crucial detail of the simulation I'm not anticipating.

liux3372 commented 2 months ago

@Duane321 apologize if this is too long.

To give a mid-week update and see if I'm doing the right things, here is how I simulate riders and driver and do the matching. Feel free to review and comment.

Riders are simulated as a torch vector of _(num_riders) * (request_times, start_locations, endlocations) Each start_locations is a (x, y) position sampled uniformly on the 0-1 grid. (Let's do the cities and airports later)

Each rider will have a lambda for a Poisson distribution to sample the number of requests every day. Those lambdas will be updated based on the number of rejects happened during the week by riders.

Each request_time is sampled either from a normal distribution with mean of the morning peak hours, afternoon peak hours, evening party hours or uniformly to represent different rider types. For simplicity, I take 250 riders each type (i.e. the first 250 rows in the vector are the riders who will take rides on a normal distributed time with mean of 8 * 60 and std of 60 minutes).

Drivers are simulated as a torch vector of _(num_drivers) * (idle_start_time, idle_duration, driverpositions) Each idle_start_time is sampled with 80% on daytime (7 AM to 10 PM) and 20% the rest of time, with each hour will be picked equally likely. Each idle_duration is sampled from an exp distribution with 1.0 / mean_idle_time as the parameter where mean_idle_time is initially set to 180 minutes. Those mean_idle_times will be updated based on the number of rejects happened during the week by drivers. Each driver_positions are placed on the sub-block 0 to 15 uniformly and each driver's initial position will be randomly assigned within each sub-block.

In terms of matching(work-in-progress), for every 30-min interval, we iterate through each sub-block. For rider requests, we mask on the request timestamp, which has to stay within [idle_start_time, idle_start_time+idle_duration] and the sub-block id; for drivers, we randomly pick a valid driver(if no valid driver, cancel the request right away), do the matching(based on acceptance probs by both parties), and update their idle_start_time to request_time+trip_duration and sub-block id to the destination's subblock id for those accepted trips. To update each driver's locations every day, I have an extra vector of (num_driver) (subblocks) to indicate a driver's location. We do matching by taking masks on this vector to get drivers in the same sub-block. And there is another vector of _(num_driver) (status)_ to show if a driver is valid(i.e. not on a trip).

*If we have to loop over these half hour interval, then the days don't afford us anything. We might as well loop over 487 time windows, and update everything.** We sample number of requests and drivers idle_start_time and idle_duration every day based on their rejects during the week, and we do the matching every 30mins to handle each request and set drivers status(idle or busy). I think by doing this way, we simulate the reality quite well and we are able to handle the computing complexity.

I'll keep working on how to parallelize on the sub-blocks.

note, if there are so many requests not able to find even a valid driver, we may consider adding more total drivers.

Duane321 commented 2 months ago

Hey, good work! Just some comments:

Riders are simulated as a torch vector of (num_riders) * (request_times, start_locations, end_locations)

Do you mean ride requests? We need to keep separate the idea of the pool of riders and the requests they produce.

Each start_locations is a (x, y) position sampled uniformly on the 0-1 grid. (Let's do the cities and airports later)

Yes!

Each rider will have a lambda for a Poisson distribution to sample the number of requests every day. Those lambdas will be updated based on the number of rejects happened during the week by riders.

Yes!

Each request_time is sampled either from a normal distribution with mean of the morning peak hours, afternoon peak hours, evening party hours or uniformly to represent different rider types. For simplicity, I take 250 riders each type (i.e. the first 250 rows in the vector are the riders who will take rides on a normal distributed time with mean of 8 * 60 and std of 60 minutes).

This sounds right. It'll help me to see this in a plot. For a type of rider, show me their distribution of requests over the data. That is, if this rider makes a requests, what time is it likely to be in? Note that this given a request is made, when does it happen? How many requests happen is dictated with that Poisson, as you mentioned.

Drivers are simulated as a torch vector of (num_drivers) * (idle_start_time, idle_duration, driver_positions) Each idle_start_time is sampled with 80% on daytime (7 AM to 10 PM) and 20% the rest of time, with each hour will be picked equally likely. Each idle_duration is sampled from an exp distribution with 1.0 / mean_idle_time as the parameter where mean_idle_time is initially set to 180 minutes. Those mean_idle_times will be updated based on the number of rejects happened during the week by drivers. Each driver_positions are placed on the sub-block 0 to 15 uniformly and each driver's initial position will be randomly assigned within each sub-block.

Great, yes!

In terms of matching(work-in-progress), for every 30-min interval, we iterate through each sub-block. For rider requests, we mask on the request timestamp, which has to stay within [idle_start_time, idle_start_time+idle_duration] and the sub-block id; for drivers, we randomly pick a valid driver(if no valid driver, cancel the request right away), do the matching(based on acceptance probs by both parties), and update their idle_start_time to request_time+trip_duration and sub-block id to the destination's sub_block id for those accepted trips. To update each driver's locations every day, I have an extra vector of (num_driver) (sub_blocks) to indicate a driver's location. We do matching by taking masks on this vector to get drivers in the same sub-block. And there is another vector of (num_driver) (status) to show if a driver is valid(i.e. not on a trip).

This sounds right. I haven't thought about the matching algo at a granularity where I could debug this, so just give it a shot and let me know if anything trips up.

We sample number of requests and drivers idle_start_time and idle_duration every day based on their rejects during the week, and we do the matching every 30mins to handle each request and set drivers status(idle or busy). I think by doing this way, we simulate the reality quite well and we are able to handle the computing complexity.

Sure, sounds reasonable.

note, if there are so many requests not able to find even a valid driver, we may consider adding more total drivers.

Yes, we will discover that many of these arbitrary values need to be tuned later. We need to anticipate that tuning by making sure things aren't super slow. (We can do a very slow run at the end, with things are properly set).

liux3372 commented 2 months ago

see the plots below regarding ride requests. Yes, I mean (num_rider_requests) * (request_times, start_locations, end_locations).

rider type

{0: 'commuters_morning',
 1: 'commuters_afternoon',
 2: 'party_goers',
 3: 'sporadic'}

image image

a rider could have two rides during the morning rush hours(not sure if it makes sense), but it's easier for vector initialization.

liux3372 commented 2 months ago

@Duane321 Let me make a few small fixes as we discussed then create a PR.

I figure out I can make the lambdas of Poisson sampled from Gamma distributions as conjugate priors for Bayesian updates. We initialize alphas and betas to 1.5 or something based on their rider types, and after every day we add alpha by the daily number of accepts and add beta by the daily number of total requests.

One thing I notice is that during the matching, for any given time interval and the sub-block, after we filter the requests based on timestamp and location index, we need to iterate through every such request. It will create another loop but usually there aren't many such requests after filtering, just FYI.

Besides, let's say we do a matching from time 0 to 29. If a driver starts at sub-block 0 during t=0 then finish a trip and now lives in sub-block 10 during t=10, it will be a valid driver only at sub-block 0 since we do the matching every 30mins. We could have a shorter time interval but it will increase runtime. Such issue will always exist unless we check every minute as a trip could take just one minute. I might tune the interval_time later to get a balance of runtime and supply.

Regarding the parallelization, since matching is a CPU bound task(rather than a I/O bound task), multi-processing works better than multi-threading here. However, as each process usually won't share any resource, we need to create two vectors as shared tensors(one to record each rider's rejects, another to record each driver's rejects and update each driver's idle timestamp and new location as we finish every trip). We have to create a lock for synchronizing access to the shared tensor in something like

import torch.multiprocessing as mp

processes = []

lock = mp.Lock()

for i in range(5):
     process = mp.Process(target=process_function, args=(shared_tensor, lock))
     processes.append(process)

Not sure if such lock will delay the parallelization. I'll start this way but feel free to advise anything better we can optimize.

Duane321 commented 2 months ago

Thanks @liux3372

This all makes sense. I have some opinions, but I'm not interacting with the simulation, so some of them could be uninformed. I'd rather get to seeing the simulated run first, record the data and make plots. At that moment, it'll be easier for me to spot issues and make suggests. If it's slow, we can save speedups for later. (e.g. maybe we hold off on parallelization until we understand the bottlenecks better?)

liux3372 commented 2 months ago

@Duane321 pls review #3 when you are available.

at the moment, I created a simulation to record the matched trip happing every day.

What I am going to do next:

What we can do later:

Please lmk if that makes sense, feel free to make comments.