Open detlefarend opened 2 years ago
Hi colleagues, DP is done! Now you could try to solve it. Good luck, have fun:)
Hi colleagues, I have thought about our yesterday's discussion. It's simply odd that no sb3 algorithm performs here. I propose a further howto 22 with same env and algo but in the setup Gym-wrapped DP env and native sb3 algo without sb3 wrapper. The same algo should perform in the same environment in the same way. I add a task for it...
What have you tried here? Which algorithm? Which configuration?
See howto rl-021 in branch dp_env_solution. Steve and Laxmikant tried various sb3 algos but nothing really works.
Before doing stuff below, I have a question, where the torque is applied? If on the first link then ignore the comparison with the Acrobot-V1. But you could try the equation from http://incompleteideas.net/book/11/node4.html but change the torque to theta1. For better text: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf page 283.
I would suggest like this first. @steveyuwono @laxmikantbaheti Please check the following:
[ ] Check the whole calculation again, especially with the torque included. For example, if you apply X torque for T time step, does it result in the correct values? (Angle 1, Angle 2, Omega 1, Omega 2). Do this without any normalizing stuff, pure system equation. You can compare also with : http://incompleteideas.net/book/11/node4.html
- [ ] Remove normalization stuff
- [ ] Check system equation
- [ ] Check timestep calculation (Like I explain above, adjust this accordingly with the max cycle)
- [ ] Check torque (Torque Range)
- [ ] Try to train it (Without normalization stuff)
[ ] If above works, then put again the normalizing stuff.
[ ] Compare this with Acrobot-V1 Gym environment (Basically this is the same environment with DP). By default, it has discrete action, but you can change it to continuous action. Also change the link mass, link length, dt, and max torque accordingly. I have tried the same configuration in DP without normalization stuff and put it on Acrobot-V1, it works with any algorithm, even with 0 threshold. From here, the problem could be in the environment itself.
@rizkydiprasetya We reviewed the current implementation of the env and training behaviour yesterday in the hackathon. We currently don't have doubts about the numerics. It was taken over from the Matplotlib example and howto RL-020 shows a plausible random behaviour. Your proposals make definitely sense, but for me, checking the entire numerics is in the current phase rather the last step. One result of our discussion was that the env is to be fine-tuned regarding latency, cycle limit and reward strategy. The colleages already care for it. Howtos RL-021/022 in combination are another good validation/demonstration of our wrappers for SB3 and Gym. It will help excluding error sources and it corresponds to our approach of test-driven development.
@steveyuwono @laxmikantbaheti As discussed please try
Please also set up howto RL-022 and run comparable trainings. If both howtos 21/22 show the same/similar/comparable behaviour then we can exclude the sb3 wrapper and the training/scenario classes as error source. If the algos still don't perform in both howtos I propose to review the algo settings together with Rizky. But if all these steps don't lead to a solution then a review of the entire numerics is necessary - as Rizky proposed.
It was taken over from the Matplotlib
From the matplotlib example, the torque is not included in the equation. My question is, how can we validate this? Imagine we put the torque in the wrong place, it will behave differently. It is like if we want to stop a car, we put positive acceleration instead of negative acceleration.
That is why, acrobot is the closest thing with this problem.
@rizkydiprasetya We reviewed the current implementation of the env and training behaviour yesterday in the hackathon. We currently don't have doubts about the numerics. It was taken over from the Matplotlib example and howto RL-020 shows a plausible random behaviour. Your proposals make definitely sense, but for me, checking the entire numerics is in the current phase rather the last step. One result of our discussion was that the env is to be fine-tuned regarding latency, cycle limit and reward strategy. The colleages already care for it. Howtos RL-021/022 in combination are another good validation/demonstration of our wrappers for SB3 and Gym. It will help excluding error sources and it corresponds to our approach of test-driven development.
@steveyuwono @laxmikantbaheti As discussed please try
- smaller latency
- higher cycle limit
- improved reward strategy
- further training runs (S7)
Please also set up howto RL-022 and run comparable trainings. If both howtos 21/22 show the same/similar/comparable behaviour then we can exclude the sb3 wrapper and the training/scenario classes as error source. If the algos still don't perform in both howtos I propose to review the algo settings together with Rizky. But if all these steps don't lead to a solution then a review of the entire numerics is necessary - as Rizky proposed.
I would say, we will try what Detlef mentioned here and also what we discussed yesterday. If we still get the same results, in which the agents do not learn anything at all, then it is worth to breakdown everything and reviewing them one by one, as what Rizky proposed.
Hi colleagues, I think we can assume that the dp simulation is correct. I propose that we review the recent training approaches and try to locate the problem source(s). Let's focus on the latency, cycle limit, buffer size of the policy algo and their correlation. We should also review the reward strategy. I'll schedule an appointment for it.
Hi colleagues, in reflection of our yesterdays meeting I had some thoughts about an overall reward strategy. To this regard, I described the env and a proposal for a strategy in the file attached. I think that it makes sense to define at least three zones based on the angle of the inner pole:
Maybe as a base for further discussions and improvements... double pendulum and its reward strategy.drawio.pdf
Hi colleagues, in reflection of our yesterdays meeting I had some thoughts about an overall reward strategy. To this regard, I described the env and a proposal for a strategy in the file attached. I think that it makes sense to define at least three zones based on the angle of the inner pole:
- Swinging up the inner pole
- Swinging up the outer pole
- Balancing
Maybe as a base for further discussions and improvements... double pendulum and its reward strategy.drawio.pdf
Hi, this definition sounds good and we can stick to this
Wanted: an agent structure/policy algorithm to control the Double Pendulum environment.
Prerequisits:
47