diambra / arena

DIAMBRA Arena: a New Reinforcement Learning Platform for Research and Experimentation
https://docs.diambra.ai
Other
317 stars 22 forks source link

The training process terminated unexpectedly #79

Closed amit-gshe closed 1 year ago

amit-gshe commented 1 year ago

Hello, when I used my nvidia P106-100 for training, when the training reached nearly 1M steps, it exited abnormally. I am not sure whether it is the problem of my graphics card or other reasons. The following is the relevant log. Command: diambra run -s 6 -d -n python3 kof.py

(1) 2-0-0 Kyo[91] 1 Andy[19]                                                                                                                                                                                                       
(4) 2-0-0 Kyo[108] -11 Robert[119]                                                                                                                                                                                                 
(0) 1-0-0 Kyo[83] 2 Andy[96]                                                                                                                                                                                                       
(3) 4-0-0 Kyo[3] -7 Shingo[9]                                                                                                                                                                                                      
(1) 2-0-0 Kyo[91] 15 Andy[4]                                                                                                                                                                                                       
(0) 1-0-0 Kyo[83] 2 Andy[94]                                                                                                                                                                                                       
------------------------------------------                                                                                                                                                                                         
| rollout/                |              |                                                                                                                                                                                         
|    ep_len_mean          | 1.81e+04     |                                                                                                                                                                                         
|    ep_rew_mean          | 2.62         |                                                                                                                                                                                         
| time/                   |              |                                                                                                                                                                                         
|    fps                  | 205          |                                                                                                                                                                                         
|    iterations           | 1241         |                                                                                                                                                                                         
|    time_elapsed         | 4639         |                                                                                                                                                                                         
|    total_timesteps      | 953088       |                                                                                                                                                                                         
| train/                  |              |                                                                                                                                                                                         
|    approx_kl            | 0.0014639758 |                                                                                                                                                                                         
|    clip_fraction        | 0.0312       |                                                                                                                                                                                         
|    clip_range           | 0.1          |                                                                                                                                                                                         
|    entropy_loss         | -1.83        |                                                                                                                                                                                         
|    explained_variance   | 0.0821       |                                                                                                                                                                                         
|    learning_rate        | 5e-06        |                                                                                                                                                                                         
|    loss                 | -0.0334      |                                                                                                                                                                                         
|    n_updates            | 15690        |                                                                                                                                                                                         
|    policy_gradient_loss | -0.00641     |                                                                                                                                                                                         
|    value_loss           | 0.0169       |                                                                                                                                                                                         
------------------------------------------                                                                                                                                                                                         
(2) 1-1-2 Chang[81] 12 Mai[16]                                                                                                                                                                                                     
🏟 (4b6e) (3)Round lost                                                                                                                                                                                                             
(3)Moving to next round                                                                                                                                                                                                            
🏟 (4b6e) Error: Counter exceeded limit (6001) waitForFightStart in game environment class                                                                                                                                          
------------------------------------------------------------------------------------------------------                                                                                                                             

        Contact support on DIAMBRA Discord Server: https://diambra.ai/discord                                                                                                                                                      
                                        AND / OR                                                                                                                                                                                   
 Open a ticket on our GitHub repository issue Tracker: https://github.com/diambra/arena/issues                                                                                                                                     

------------------------------------------------------------------------------------------------------                                                                                                                             
Closing console 

Diambra Engine Image: DIGEST:sha256:3ef428b6c827b3b36120019ca6507c4381487f8fc2581b07c6dc910e41c2846a My Host Setup:

> neofetch --off                                                             
OS: Manjaro Linux x86_64 
Kernel: 6.3.5-2-MANJARO 
Uptime: 2 hours, 43 mins 
Packages: 1614 (pacman) 
Shell: zsh 5.9 
DE: Xfce usage. 
CPU: Intel i5-10500 (12) @ 4.500GHz 
GPU: NVIDIA P106-100 
GPU: Intel CometLake-S GT2 [UHD Graphics 630] 
Memory: 10113MiB / 15661MiB
alexpalms commented 1 year ago

@amit-gshe thanks for opening this issue. This is a problem with the engine/environment, not with your GPU. It implements some checks to make sure it does not get stuck in while loops mainly to handle emulator instabilities. I see this happened using the latest engine image.

It would be very useful if you could share additional information so that we can try to replicate this. In particular: 1) Are you able to identify the stage the agent was playing when this happened? 2) Would you be able to share the ram values the step before this happened? 3) Is this happened only once? Or are you able to replicate it consistently (like every time you reach 1M training steps)?

Looking forward to your reply to optimize the search for the bug!

alexpalms commented 1 year ago

@amit-gshe here is a first update: I have been able to spot a problem in KOF. I am not 100% sure this is what is causing your problem, but it is very probable. The good news is that I am able to replicate it at will, this means that the bugfix should not be hard to find.

In addition, this will be totally retrocompatible, so you will be able to keep everything you already have.

The other good news is that it allowed me to implement a robust verification procedure for all environments, that will prevent similar issues to happen in the future.

I will keep you posted.

amit-gshe commented 1 year ago

Hello, @alexpalms I tried it a few more times, and the problem disappeared strangely.

I checked the tensorboard log and found that the fps dropped significantly. The previous fps was about 200/s in the issue description above, but now when the same concurrent container is started, the fps is only 80/s. When using GPU and CPU for training, there is no difference in the fps value, but the CPU usage is significantly reduced, and the training load comes to the GPU.

But for now even over 1M time steps, the training does not drop out unexpectedly.

For your questions above:

  1. Yes, I have print the ram state as: (3) 4-0-0 Kyo[3] -7 Shingo[9], which means that in env 3 and stage 4, the player side wins 0 round and opp wins 0 round. It is round of Kyo vs Shingo. Kyo's current health is 3 and Shingo's health is 9. And agent got -7 reward which means Kyo takes 7 damage. From the issue description above we can see that the abnormal container is env 3 and stage 4. The agent lost the round when Kyo vs Shingo. There have been normal situations before where kyo fails and Benimaru will continue and eventually go to the next stage.
  2. See answer 1. Sorry I can just give the ram states I have mentioned above because I can not replicate the issue now.
  3. Twice. Not sure if it's related to high fps values. The fps have reached more than 200 when the process exited abnormally.
alexpalms commented 1 year ago

@amit-gshe Perfect, thanks a lot for the feedback. Stage 4 is exactly where the problem I mentioned before happens, I already fixed it. I am about to push a new engine, just completing the final checks. I will post an explanation also for the FPS (that is normal and expected) in the next message to confirm the closure of the issue. The fact it has not happened is normal, it has same randomness, but my fix will solve it.

Thanks a lot for your feedback.

Will update later here.

alexpalms commented 1 year ago

@amit-gshe I just pushed the new engine that integrates the fix needed for KOF in stage 4. I am confident this will prevent this failure to happen in the future. The engine was not handling properly the fact that there is a single opponent in that stage.

Regarding the frame rate difference you noticed: when the game transitions between rounds or stages, it runs faster than when it is in the combat phase. The bug that caused your error was keeping the engine stuck in a transitioning condition thus letting it iterate faster. But the correct frame rate is the one you saw under normal functioning, that is slower than the transitioning speed.

You will receive the new image automatically at the next execution of diambra run ... command because, as usual, it will check for the latest engine image available and will automatically pull it from dockerhub, unless you explicitly prevent it to do so with the dedicated command line arg.

KOF is the most recently added game, so it has been tested less than the others. For this reason all the issues you opened have been very useful to improve its robustness, thanks a lot!

To fix this bug, I managed to implement an additional automated test that will now be run on all new games, that improves a lot the robustness.

I am closing this issue, but do not hesitate to let us know in case you encounter other problems.