Open Alessiobrini opened 3 years ago
Following from this paper At a first glance, we should:
Possible issues:
I noticed that the log loss as described in the paper assumes that you use again you actor to compute the action given the states in the batch. I can do the same by doing another forward pass over the Qnet.
There are two ways to do this knowledge injection into the training process:
Then the choice of the loss determines what we want to do exactly:
Currently implemented solution with mse loss added to the total loss with a rescaling factor. Testing phase now.
Added the implementation also to the MisspecDQN case
Need to be added to PPO algorithm
Insert a module after the loss computation of the algorithm that perturb the parameters by doing behavioral cloning from an expert (Garleanu and Pedersen solution)