Problems of MARL - PPO:

ScaleMind-C9308A / some

0 stars 0 forks source link

Open skydvn opened 1 year ago

skydvn commented 1 year ago

We have not insert the training of Critic for PPO.
We should split the training of Actor and Critic separately, to be more specific:
- The Actor will use the advantage loss.
- The Critic will use the MSE loss between infered Q from Critic + true Q