Module for behavioral cloning

Alessiobrini / Deep-Reinforcement-Trading-with-Predictable-Returns

Reinforcement Learning framework to make synthetic experiments in the financial domain

21 stars 11 forks source link

Module for behavioral cloning #2

Open Alessiobrini opened 3 years ago

Alessiobrini commented 3 years ago

Insert a module after the loss computation of the algorithm that perturb the parameters by doing behavioral cloning from an expert (Garleanu and Pedersen solution)

Alessiobrini commented 3 years ago

Following from this paper At a first glance, we should:

Store the action from the expert in the experience;
compute log-loss (or mse loss) between the action performed by the agent and the action of the expert;
compute the gradients with respect to the model parameters (wrt actor if acror-critic) and rescale them by some factor
apply gradient to the model

Possible issues:

It is better to sum the logloss part to the original loss and then compute the gradient, or take them separated? How does it work the rescaling in this way?
Is there a specific learning rate for that part of the loss? The paper seems not to mention that

Alessiobrini commented 3 years ago

I noticed that the log loss as described in the paper assumes that you use again you actor to compute the action given the states in the batch. I can do the same by doing another forward pass over the Qnet.

Alessiobrini commented 3 years ago

There are two ways to do this knowledge injection into the training process:

Compute new current action after updating the net
Use previously computed action and add a part of the loss to the DQN loss. I would go for this one because it is easier to implement and problably also more reasonable

Then the choice of the loss determines what we want to do exactly:

log loss requires probability and it could be good if you have portfolio weights (like in the original paper) or q values as I am implementing
mse loss for real valued loss, without using probabilities and considering the unscaled_action with De Prado trick.

Alessiobrini commented 3 years ago

Currently implemented solution with mse loss added to the total loss with a rescaling factor. Testing phase now.

Alessiobrini commented 3 years ago

Added the implementation also to the MisspecDQN case

Alessiobrini commented 3 years ago

Need to be added to PPO algorithm