The Reinforcement Learning

The pricing policy has parameters $\theta$s, and our goal is to optimized the simulation in order to produce max profits. To do so, we need to calculate gradient of objective function(profit) w.r.t $\theta$ s then use SGD to update those $\theta$s.

Coding in NumPy to compute gradients:

We need a function to take $\theta$ s and returns the gradient;
Current parameters are: Intercept, ride_time_coeff, ride_distance_coeff
Later for a more elaborate pricing policy: Vary by region, time of day, time of week etc.

TODOS:

Come up with expressions of gradients for the pricing policy then optimize: Verify it by freezing all the random seeds, make it deterministic; Then Do SGD($\Delta\theta$ s) to move the policy over weeks.
Do re-parameterization with poisson distribution by standard normal distribution: Take scatter plots then stretch them to approximate; Do finite difference, and check gradient with differences.
Run a simulation, move the coefficients(like online learning); Over 100 weeks, weekly avg. profit should increase on avg.

@Duane321 Please lmk if this makes sense, my questions are:

Given the pricing formula in line 323: price_of_ride = self.pricing_params[0] + self.pricing_params[1] ride_minutes + self.pricing_params[2] ride_miles in ridesharing_simulation.py The gradient of intercept(i.e. self.pricing_params[0]) would be zero, the gradient of ride_time_coeff (i.e. self.pricing_params[1]) would be weekly ride_minutes and the gradient of ride_distancecoeff(i.e. self.pricing_params[2]) would be weekly ride_miles. Then SGD would be $\theta{t+1} = \theta_{t}+\eta*\Delta\theta$ Does that sound good? In this case, I think we need to calculate the weekly average ride_minutes and ride_miles, so is it technically a weekly batch gradient descent rather than SGD?
Suppose I figure out how to decompose the re-parameterization trick for poisson and gamma distributions. How does them link to the gradient descent above? (We can fix the simulation in this way?)
How does those elasticity parameters come into play regarding the optimization?

Hey @liux3372 , some comments/

do re-parameterization with Poisson by standard normal distribution

I was just giving some intuition there. You can't use 0-mean 1-variance normal samples to generate poisson's (as far as I know). Here's what you need. The Poisson has a parameter lambda. Given a lambda, we can produce a sample of N values distributed according to Poisson(Lambda). What we want is a deterministic function that, when given samples of Poisson(Lambda = 1), we will get samples of a Poisson for whatever lambda value is given to the function. If we have this, then we can compute the derivative of some function of the samples with respect to the Lambda. This will be necessary because as we calculate the gradient of profit with respect to the pricing parameters, these values will appear as intermediate values. Make sense?

To answer your other questions. We have 3 parameters. The gradient is has three elements. Each is the derivative of Total Profit to each of the 3 pricing parameters. If we call the vector of the 3 pricing parameters theta, then say total profit = f(theta, <a bunch of random values that don't depend on the simuation parameters>). For the <a bunch of ...> you'll want something like a large batch of 0-mean 1-variance normals and whatever else you need.

The challenge is this: compute d f(theta, ...) / d theta_1, d f(theta, ...) / d theta_2 and d f(theta, ...) / d theta_3. This is tricky because the f() represents the big simulation you created. You have to convert it into an analytic expression such that you can calculate the gradient.

One clue is, the derivative with respect to the intercept (theta_1) is not zero. Think about it: if you increase the flat rate charged on every ride, don't you expect total profit to change?

An alternative is to use PyTorch. That actually might be easier, but you'd have to re-write your code for that.

Let me know the problems you run into, as you run into them.

Elasticity parameterizes the simulation. To change elasticity is to change f(). We are using elasticities to make sure our simulation is "intuitive" and "realistic".

@Duane321 "What we want is a deterministic function that, when given samples of Poisson(Lambda = 1), we will get samples of a Poisson for whatever lambda value is given to the function." Do you mean I need a deterministic function to get lambda from samples of Poisson distribution?

I am working on using finite difference and gradient ascent(because we try to maximize the total profits) to update the three thetas of the pricing function we discussed above. A few questions when I deal with the problem:

I have fixed the random seed from pytorch and numpy, it is good enough to control the randomness or do I have to detach the random variables to the standard normal variables?
Do we compute gradients every week or every day? (at the moment I do it every week because I feed in a new learning rate and theta vectors for each week)
Do we pick one theta to update for every week while keeping other thetas the same? Does that sound good?
Do we do batch gradient ascent on every week or do we do SGD on each ride one at a time.

here is how I do the theta update, lmk if that makes sense:

prev_week_profit = None
if prev_week_profit:
      gradient = (weekly_profit_sum - prev_week_profit) / (epsilon + lr * gradient)
      print(f'gradient:{gradient}')
      # gradient ascent as we maximize the profit
      T0_pricing_params[0] = T0_pricing_params[0] + lr * gradient
T0_pricing_params[0] += epsilon
prev_week_profit = weekly_profit_sum

I have fixed the random seed from pytorch and numpy, it is good enough to control the randomness or do I have to detach the random variables to the standard normal variables?

The seed isn't enough. I'll do an example with the normal distribution. Let's say we want to approximate the derivative of E[x^3] with respect to the standard deviation parameter s. (There is an exact analytical way to do this, but we want something that will generalize). This is called the reparamization trick. What we do is we generate something like a 1000 standard normals. Call them z_i's. Now, how do we make x_i's that are distributed according to a given mean mu and standard deviation s parameter? We use this relationship x_i = mu + z_i*s.

So how do we approximate d E[x^3] / d s? Let f(mu, s) = 1/N sum_i (mu + z_i*s)^3. The z_i's are frozen, pre-sampled standard normals. They don't depend on mu or s. Now we just compute d f(mu, s) / d s and that'll approximate d E[x^3] / d s

What I need you to do is figure out how to do this equivalently for the Poisson. I'd start by googling "Reparameterization trick for the Poisson"

Do we compute gradients every week or every day?

Every week. It's after each run of the simulation. Run simulation -> take some gradient steps -> run simulations -> take some gradient steps -> ...

Do we pick one theta to update for every week while keeping other thetas the same?

No, we update the whole theta vector in a gradient step.

Do we do batch gradient ascent on every week or do we do SGD on each ride one at a time.

The gradient step should reference the entire simulation.

here is how I do the theta update, lmk if that makes sense:

The gradient is of the empirical average of profit with respect to theta. You need to differentiate through the entire simulation, either manually with numpy (which will involve some thinking) or by using PyTorch, which will involve re-writing much of the simulation.

You should be able to tell whether this is working or not by running it and seeing profits improving. I might start with that, as it'll be a good test of whether you've derived the gradient correctly.

Duane321 / reinforcement_learning_for_rideshare_pricing

The Reinforcement Learning #4