Open Yuqing0127 opened 3 years ago
i beg your pardon ?
The problem is that the optimal critic network weight is certain, but the actual critic network weight will not converge to this value. And the given initial guess critic network weight is different, we get the different final critic network weight. This means that we can not obtain the optimal control policy by this code or this method.
i agree with you
I guess this kind of algorithms or works all have the same problem. Adding persistent noise is not enough for convergence to the right weight values, while the formal proof given in the paper neglects the practical techniques that ensures the right solution.
I have the same problem as in first floor