Duane321 / reinforcement_learning_for_rideshare_pricing

MIT License
3 stars 0 forks source link

The Reinforcement Learning #4

Open liux3372 opened 1 month ago

liux3372 commented 1 month ago

The pricing policy has parameters $\theta$s, and our goal is to optimized the simulation in order to produce max profits. To do so, we need to calculate gradient of objective function(profit) w.r.t $\theta$ s then use SGD to update those $\theta$s.

Coding in NumPy to compute gradients:

  1. We need a function to take $\theta$ s and returns the gradient;
  2. Current parameters are: Intercept, ride_time_coeff, ride_distance_coeff
  3. Later for a more elaborate pricing policy: Vary by region, time of day, time of week etc.

TODOS:

  1. Come up with expressions of gradients for the pricing policy then optimize: Verify it by freezing all the random seeds, make it deterministic; Then Do SGD($\Delta\theta$ s) to move the policy over weeks.
  2. Do re-parameterization with poisson distribution by standard normal distribution: Take scatter plots then stretch them to approximate; Do finite difference, and check gradient with differences.
  3. Run a simulation, move the coefficients(like online learning); Over 100 weeks, weekly avg. profit should increase on avg.

@Duane321 Please lmk if this makes sense, my questions are:

Duane321 commented 1 month ago

Hey @liux3372 , some comments/

do re-parameterization with Poisson by standard normal distribution

I was just giving some intuition there. You can't use 0-mean 1-variance normal samples to generate poisson's (as far as I know). Here's what you need. The Poisson has a parameter lambda. Given a lambda, we can produce a sample of N values distributed according to Poisson(Lambda). What we want is a deterministic function that, when given samples of Poisson(Lambda = 1), we will get samples of a Poisson for whatever lambda value is given to the function. If we have this, then we can compute the derivative of some function of the samples with respect to the Lambda. This will be necessary because as we calculate the gradient of profit with respect to the pricing parameters, these values will appear as intermediate values. Make sense?

To answer your other questions. We have 3 parameters. The gradient is has three elements. Each is the derivative of Total Profit to each of the 3 pricing parameters. If we call the vector of the 3 pricing parameters theta, then say total profit = f(theta, <a bunch of random values that don't depend on the simuation parameters>). For the <a bunch of ...> you'll want something like a large batch of 0-mean 1-variance normals and whatever else you need.

The challenge is this: compute d f(theta, ...) / d theta_1, d f(theta, ...) / d theta_2 and d f(theta, ...) / d theta_3. This is tricky because the f() represents the big simulation you created. You have to convert it into an analytic expression such that you can calculate the gradient.

One clue is, the derivative with respect to the intercept (theta_1) is not zero. Think about it: if you increase the flat rate charged on every ride, don't you expect total profit to change?

An alternative is to use PyTorch. That actually might be easier, but you'd have to re-write your code for that.

Let me know the problems you run into, as you run into them.

Elasticity parameterizes the simulation. To change elasticity is to change f(). We are using elasticities to make sure our simulation is "intuitive" and "realistic".

liux3372 commented 4 weeks ago

@Duane321 "What we want is a deterministic function that, when given samples of Poisson(Lambda = 1), we will get samples of a Poisson for whatever lambda value is given to the function." Do you mean I need a deterministic function to get lambda from samples of Poisson distribution?

I am working on using finite difference and gradient ascent(because we try to maximize the total profits) to update the three thetas of the pricing function we discussed above. A few questions when I deal with the problem:

here is how I do the theta update, lmk if that makes sense:

prev_week_profit = None
if prev_week_profit:
      gradient = (weekly_profit_sum - prev_week_profit) / (epsilon + lr * gradient)
      print(f'gradient:{gradient}')
      # gradient ascent as we maximize the profit
      T0_pricing_params[0] = T0_pricing_params[0] + lr * gradient
T0_pricing_params[0] += epsilon
prev_week_profit = weekly_profit_sum
Duane321 commented 3 weeks ago

I have fixed the random seed from pytorch and numpy, it is good enough to control the randomness or do I have to detach the random variables to the standard normal variables?

The seed isn't enough. I'll do an example with the normal distribution. Let's say we want to approximate the derivative of E[x^3] with respect to the standard deviation parameter s. (There is an exact analytical way to do this, but we want something that will generalize). This is called the reparamization trick. What we do is we generate something like a 1000 standard normals. Call them z_i's. Now, how do we make x_i's that are distributed according to a given mean mu and standard deviation s parameter? We use this relationship x_i = mu + z_i*s.

So how do we approximate d E[x^3] / d s? Let f(mu, s) = 1/N sum_i (mu + z_i*s)^3. The z_i's are frozen, pre-sampled standard normals. They don't depend on mu or s. Now we just compute d f(mu, s) / d s and that'll approximate d E[x^3] / d s

What I need you to do is figure out how to do this equivalently for the Poisson. I'd start by googling "Reparameterization trick for the Poisson"

Do we compute gradients every week or every day?

Every week. It's after each run of the simulation. Run simulation -> take some gradient steps -> run simulations -> take some gradient steps -> ...

Do we pick one theta to update for every week while keeping other thetas the same?

No, we update the whole theta vector in a gradient step.

Do we do batch gradient ascent on every week or do we do SGD on each ride one at a time.

The gradient step should reference the entire simulation.

here is how I do the theta update, lmk if that makes sense:

The gradient is of the empirical average of profit with respect to theta. You need to differentiate through the entire simulation, either manually with numpy (which will involve some thinking) or by using PyTorch, which will involve re-writing much of the simulation.

You should be able to tell whether this is working or not by running it and seeing profits improving. I might start with that, as it'll be a good test of whether you've derived the gradient correctly.