diambra / arena

DIAMBRA Arena: a New Reinforcement Learning Platform for Research and Experimentation
https://docs.diambra.ai
Other
307 stars 22 forks source link

diambra run -s will hang in parallel environments. #64

Closed amit-gshe closed 1 year ago

amit-gshe commented 1 year ago

Log:

(diambra-arena-sb3) ➜  ai /home/amit/miniconda3/envs/diambra-arena-sb3/bin/diambra run -d -n -s=10 python kof_opt.py
πŸ–₯                                                                                                                
πŸ–₯  Starting DIAMBRA environment:                                                                                 
πŸ–₯  starting diambra                                                                                              
πŸ–₯  Request                                                                                                       
πŸ–₯  logged in                                                                                                     
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  creating container                                                                                            
πŸ–₯  container running                                                                                             
πŸ–₯  (6730) started env container                                                                                  
πŸ–₯  waiting for grpc                                                                                              
Stored credentials found.                                                                                       
Authorization granted.
Server listening on 0.0.0.0:50051
πŸ–₯  closing streamer                                                                                              
πŸ–₯  closing                                                                                                       
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
πŸ–₯  container running                                                                                             
πŸ–₯  (667a) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (667a) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (ca93) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (ca93) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (f3d4) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (f3d4) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (4f8e) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (4f8e) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (3086) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  creating container                                                                                            
πŸ–₯  in go func                                                                                                    
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (3086) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (4a91) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  creating container                                                                                            
πŸ–₯  in go func                                                                                                    
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (ca93) Authorization granted.                                                                                  
🏟 (ca93) Server listening on 0.0.0.0:50051                                                                       
🏟 (4a91) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (ead5) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  in go func                                                                                                    
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (ead5) Stored credentials found.                                                                               
🏟 (4f8e) Authorization granted.                                                                                  
🏟 (4f8e) Server listening on 0.0.0.0:50051                                                                       
πŸ–₯  container running                                                                                             
πŸ–₯  (0c81) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  creating container                                                                                            
πŸ–₯  in go func                                                                                                    
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (0c81) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (d4af) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  DIAMBRA environment started                                                                                   
πŸ–₯  in go func                                                                                                    
πŸ–₯  running command                                                                                               
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (d4af) Stored credentials found.                                                                               
🏟 (ead5) Authorization granted.                                                                                  
🏟 (ead5) Server listening on 0.0.0.0:50051                                                                       
🏟 (0c81) Authorization granted.                                                                                  
🏟 (0c81) Server listening on 0.0.0.0:50051                                                                       
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32889 (timeout=60s)...  
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32884 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32886 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32890 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32882 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
🏟 (ead5) Environment initialization ...                                                                          
🏟 (0c81) Environment initialization ...                                                                          
🏟 (ca93) Environment initialization ...                                                                          
🏟 (6730)                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                 
      .:-:-**#*#####+***+=-:.                                                                                    
 ..-++####+#################=+=:.                                                                                
:+*#*###########################*-.                                                                              
   .-+#############################=.                                                                            
      .-*###########################+.                                                                           
        .=######++======++*#########*#=. ........ ...     .......     ...........     ........     ........ ............     ............         ...........                                                                     
          -*=------:---------=*########- .------..-----:. .------.   .-----------.    --------.   :-------- --------------:. --------------:.    :----------:                                                                     
          .:--------:---------:=#######* .------..-------..------.   :-----:-----:.   ---------. .--------- ------:..:-----: ------:..------:   .------:-----.                                                                    
        .:------::---::--------:+#######..------. .------..------.  .------.:-----.   ----------.---------- ------:..------. ------:  :-----.  .:-----:.------.                                                                   
        :-----::.:---:-----::---:######* .------. .------..------. .:-----: .------.  ------:-------------- ------::-----:.. ------:.-----:.   .------..------.                                                                   
       .--:..  . :::.:.---.-:---.#####*- .------. .------..------. .------.::------:  ------:.-----.:------ ------:  :-----: ------:.:------: .------:.:-------.                                                                  
        ..:..:::.:::.::....:.::--:####*. .------::------:..------..------:.:::------. ------: .---. :-----: -------::------: ------:  ------: :------..::------:                                                                  
       .-::::::---...:--. .-.:-::=###+.  .::::::::::::..  .::::::..::::::.   .::::::. ::::::.  .:.  ::::::: ::::::::::::::.  :::::::  ::::::: ::::::.    .::::::                                                                  
       .--:.   .:--...::. ..::+####*:                                                                            
       :--:..   .---:...:.   .:+*+:.                                                                             
       .:---:::----.. ....    .:.                                                                                
        .:------:..:...  :.                                                                                      
         .......   ..:.  ..                                                                                      

                                                                   DIAMBRAβ„’ | Dueling AI Arena                   
                                                              https://diambra.ai - info@diambra.ai               

                                   Usage of this software is subject to our Terms of Use described at https://diambra.ai/terms                                                                                                    

                                                         DIAMBRAβ„’ is a Trade Mark, Β© Copyright 2018-2023         

-----------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                 

Environment initialization ...                                                                                   
🏟 (4f8e) Environment initialization ...                                                                          
🏟 (ead5) SHA256 check ok. Correct rom file found.                                                                
🏟 (ca93) SHA256 check ok. Correct rom file found.                                                                
🏟 (0c81) SHA256 check ok. Correct rom file found.                                                                
🏟 (6730) SHA256 check ok. Correct rom file found.                                                                
🏟 (ead5) Fontconfig error: Cannot load default config file                                                       
🏟 (ead5) 1 Completed console init                                                                                
🏟 (4f8e) SHA256 check ok. Correct rom file found.                                                                
🏟 (0c81) Fontconfig error: Cannot load default config file                                                       
🏟 (ca93) Fontconfig error: Cannot load default config file                                                       
🏟 (ead5) Warning: -video none doesn't make much sense without -seconds_to_run                                    
🏟 (ead5) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                 
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                   
🏟 (ca93) 1 Completed console init                                                                                
🏟 (0c81) 1 Completed console init                                                                                
🏟 (6730) Fontconfig error: Cannot load default config file                                                       
🏟 (0c81) Warning: -video none doesn't make much sense without -seconds_to_run                                    
🏟 (0c81) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                 
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                   
🏟 (ca93) Warning: -video none doesn't make much sense without -seconds_to_run                                    
🏟 (ca93) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                 
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                   
🏟 (6730) 1 Completed console init                                                                                
🏟 (6730) Warning: -video none doesn't make much sense without -seconds_to_run                                    
🏟 (6730) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                 
🏟 (6730) ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                          
🏟 (4f8e) Fontconfig error: Cannot load default config file                                                       
🏟 (4f8e) 1 Completed console init                                                                                
🏟 (4f8e) Warning: -video none doesn't make much sense without -seconds_to_run                                    
🏟 (4f8e) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                 
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                   
🏟 (ead5) Unable to create history.db                                                                             
🏟 (ead5) Unable to create history.db                                                                             
🏟 (ead5) Unable to create history.db                                                                             
🏟 (0c81) Unable to create history.db                                                                             
🏟 (0c81) Unable to create history.db                                                                             
🏟 (0c81) Unable to create history.db                                                                             
🏟 (6730) Unable to create history.db                                                                             
🏟 (6730) Unable to create history.db                                                                             
🏟 (6730) Unable to create history.db                                                                             
🏟 (4f8e) Unable to create history.db                                                                             
🏟 (4f8e) Unable to create history.db                                                                             
🏟 (4f8e) Unable to create history.db                                                                             
🏟 (ca93) Unable to create history.db                                                                             
🏟 (ca93) Unable to create history.db                                                                             
🏟 (ca93) Unable to create history.db                                                                             
🏟 (0c81) Num. of Channels = 4                                                                                    
Screen Dim (W x H) = 320 240                                                                                     
🏟 (0c81) Closing the stream gobbler                                                                              
🏟 (ead5) Num. of Channels = 4                                                                                    
Screen Dim (W x H) = 320 240                                                                                     
🏟 (ead5) Closing the stream gobbler                                                                              
🏟 (4f8e) Num. of Channels = 4                                                                                    
Screen Dim (W x H) = 320 240                                                                                     
🏟 (4f8e) Closing the stream gobbler                                                                              
🏟 (ca93) Num. of Channels = 4                                                                                    
Screen Dim (W x H) = 320 240                                                                                     
Closing the stream gobbler                                                                                       
🏟 (0c81) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value                                                                                                                
(Recorder) Frame encoding enabled.                                                                               
(Recorder) Compression quality: 95                                                                               
(8)Buttons configuration:                                                                                        
(8)  SK = But4                                                                                                   
(8)  SP = But3                                                                                                   
(8)  WK = But2                                                                                                   
(8)  WP = But1                                                                                                   
(8)Game Continue Val = 0                                                                                         
(8)Show final = 0                                                                                                
🏟 (0c81) (8)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                  
(8)1P Environment                                                                                                
(8)Player side = P1                                                                                              
(8)Number of outfits = [1, 1]                                                                                    
🏟 (0c81) done.                                                                                                   
Native frame shape = [240 X 320 X 4]                                                                             
User defined frame_shape = [128 X 128 X 1]                                                                       
Resize flag = 1                                                                                                  
Grayscale flag = 1                                                                                               
🏟 (ead5) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value                                                                                                                
(Recorder) Frame encoding enabled.                                                                               
(Recorder) Compression quality: 95                                                                               
🏟 (ead5) (7)Buttons configuration:                                                                               
(7)  SK = But4                                                                                                   
(7)  SP = But3                                                                                                   
🏟 (ead5) (7)  WK = But2                                                                                          
(7)  WP = But1                                                                                                   
🏟 (ead5) (7)Game Continue Val = 0                                                                                
(7)Show final = 0                                                                                                
🏟 (ead5) (7)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                  
(7)1P Environment                                                                                                
(7)Player side = P1                                                                                              
(7)Number of outfits = [1, 1]                                                                                    
🏟 (ead5) done.                                                                                                   
Native frame shape = [240 X 320 X 4]                                                                             
User defined frame_shape = [128 X 128 X 1]                                                                       
Resize flag = 1                                                                                                  
Grayscale flag = 1                                                                                               
🏟 (4f8e) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value                                                                                                                
(Recorder) Frame encoding enabled.                                                                               
(Recorder) Compression quality: 95                                                                               
(4)Buttons configuration:                                                                                        
(4)  SK = But4                                                                                                   
(4)  SP = But3                                                                                                   
🏟 (4f8e) (4)  WK = But2                                                                                          
(4)  WP = But1                                                                                                   
(4)Game Continue Val = 0                                                                                         
(4)Show final = 0                                                                                                
🏟 (4f8e) (4)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                  
(4)1P Environment                                                                                                
(4)Player side = P1                                                                                              
(4)Number of outfits = [1, 1]                                                                                    
done.                                                                                                            
Native frame shape = [240 X 320 X 4]                                                                             
User defined frame_shape = [128 X 128 X 1]                                                                       
Resize flag = 1                                                                                                  
Grayscale flag = 1                                                                                               
🏟 (ca93) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value                                                                                                                
(Recorder) Frame encoding enabled.                                                                               
(Recorder) Compression quality: 95                                                                               
(2)Buttons configuration:                                                                                        
(2)  SK = But4                                                                                                   
(2)  SP = But3                                                                                                   
(2)  WK = But2                                                                                                   
(2)  WP = But1                                                                                                   
(2)Game Continue Val = 0                                                                                         
(2)Show final = 0                                                                                                
🏟 (ca93) (2)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                  
(2)1P Environment                                                                                                
(2)Player side = P1                                                                                              
(2)Number of outfits = [1, 1]                                                                                    
🏟 (ca93) done.                                                                                                   
Native frame shape = [240 X 320 X 4]                                                                             
User defined frame_shape = [128 X 128 X 1]                                                                       
Resize flag = 1                                                                                                  
Grayscale flag = 1                                                                                               
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=8, seed=8, env_address='127.0.0.1:32890', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=False, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='discrete', attack_but_combination=False, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=7, seed=7, env_address='127.0.0.1:32889', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=False, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='discrete', attack_but_combination=False, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=4, seed=4, env_address='127.0.0.1:32886', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=False, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='discrete', attack_but_combination=False, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=2, seed=2, env_address='127.0.0.1:32884', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=False, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='discrete', attack_but_combination=False, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
Process ForkServerProcess-10:
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 23, in __init__
    self.client = Client(env_address, grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/engine/__init__.py", line 11, in __init__
    grpc.channel_ready_future(self.channel).result(timeout=timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 29, in __init__
    self.arena_engine = DiambraEngine(env_settings.env_address, env_settings.grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 26, in __init__
    raise Exception(CONNECTION_FAILED_ERROR_TEXT) from e
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Process ForkServerProcess-4:
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 23, in __init__
    self.client = Client(env_address, grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/engine/__init__.py", line 11, in __init__
    grpc.channel_ready_future(self.channel).result(timeout=timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 29, in __init__
    self.arena_engine = DiambraEngine(env_settings.env_address, env_settings.grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 26, in __init__
    raise Exception(CONNECTION_FAILED_ERROR_TEXT) from e
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Process ForkServerProcess-2:
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 23, in __init__
    self.client = Client(env_address, grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/engine/__init__.py", line 11, in __init__
    grpc.channel_ready_future(self.channel).result(timeout=timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 29, in __init__
    self.arena_engine = DiambraEngine(env_settings.env_address, env_settings.grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 26, in __init__
    raise Exception(CONNECTION_FAILED_ERROR_TEXT) from e
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Process ForkServerProcess-6:
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 23, in __init__
    self.client = Client(env_address, grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/engine/__init__.py", line 11, in __init__
    grpc.channel_ready_future(self.channel).result(timeout=timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 29, in __init__
    self.arena_engine = DiambraEngine(env_settings.env_address, env_settings.grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 26, in __init__
    raise Exception(CONNECTION_FAILED_ERROR_TEXT) from e
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Process ForkServerProcess-7:
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 23, in __init__
    self.client = Client(env_address, grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/engine/__init__.py", line 11, in __init__
    grpc.channel_ready_future(self.channel).result(timeout=timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 29, in __init__
    self.arena_engine = DiambraEngine(env_settings.env_address, env_settings.grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 26, in __init__
    raise Exception(CONNECTION_FAILED_ERROR_TEXT) from e
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
🏟 (f3d4) Authorization granted.                                                                                  
🏟 (3086) Authorization granted.                                                                                  
🏟 (f3d4) Server listening on 0.0.0.0:50051                                                                       
🏟 (3086) Server listening on 0.0.0.0:50051                                                                       
🏟 (667a) Authorization granted.                                                                                  
🏟 (667a) Server listening on 0.0.0.0:50051                                                                       
🏟 (d4af) Authorization granted.                                                                                  
🏟 (d4af) Server listening on 0.0.0.0:50051                                                                       
🏟 (4a91) Authorization granted.                                                                                  
🏟 (4a91) Server listening on 0.0.0.0:50051                                                                       
🏟 (6730) 1 FAILED TO REGISTER MAME resources                                                                     
terminate called without an active exception                                                                     
Process ForkServerProcess-1:                                                                                     
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 147, in _env_init
    response = self.client.EnvInit(env_settings_pb)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1682495320.110126843","description":"Error received from peer ipv4:127.0.0.1:32882","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket closed","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 150, in _env_init
    response = self.client.GetError(model.Empty())
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1682495320.111108625","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1682495320.111107613","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 32, in __init__
    env_info_dict = self.arena_engine.env_init(self.env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 161, in env_init
    response = self._env_init(env_settings_pb)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 154, in _env_init
    raise Exception(CONNECTION_FAILED_ERROR_TEXT)
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Traceback (most recent call last):
  File "kof_opt.py", line 47, in <module>
    env, num_envs = make_sb3_env("kof98umh", settings, wrappers_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 60, in make_sb3_env
    env = SubprocVecEnv([_make_sb3_env(i + start_index) for i in range(num_envs)],
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 112, in __init__
    observation_space, action_space = self.remotes[0].recv()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (6730) stopping container                                                                                     
πŸ–₯  couldn't stop container: Error response from daemon: No such container: 6730d10ece70880e2cc113480f80c3409bf5da3335b1e41d8d6f617a22a7dcad                                                                                       
πŸ–₯  (667a) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (ca93) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (f3d4) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (4f8e) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (3086) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (4a91) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (ead5) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (0c81) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (d4af) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  Couldn't cleanup DIAMBRA Env: Error response from daemon: No such container: 6730d10ece70880e2cc113480f80c3409bf5da3335b1e41d8d6f617a22a7dcad                                                                                  
πŸ–₯  Couldn't run: exit status 1                                                                                                         

I inspect the top command and it seems that 2 container processes may get in dead lock so can't respond to the client. image

alexpalms commented 1 year ago

@amit-gshe can you please add a description of your environment: 1) Operating system 2) shell env you are using 3) Python version 4) Print the pip list result 5) Have you tried with a smaller number of environments? Like 4?

amit-gshe commented 1 year ago
  1. Operating system
    OS: Deepin 20.9 x86_64 
    Kernel: 5.18.17-amd64-desktop-community-hwe 
    Uptime: 7 hours, 44 mins 
    Packages: 2422 (dpkg) 
    Shell: zsh 5.7.1 
    Terminal: /dev/pts/4 
    CPU: Intel i5-9400 (6) @ 4.100GHz 
    GPU: Intel UHD Graphics 630 
    Memory: 8103MiB / 31974MiB 
  2. shell env you are using: zsh
  3. Python version: pythonminiconda activated,
    (diambra-arena-sb3) ➜  ~ python                        
    Python 3.8.16 (default, Mar  2 2023, 03:21:46) 
  4. Print the pip list result
    (diambra-arena-sb3) ➜  ~ pip list                                 
    Package                      Version
    ---------------------------- ----------
    absl-py                      1.4.0
    aiosignal                    1.3.1
    ale-py                       0.7.4
    astunparse                   1.6.3
    attrs                        23.1.0
    AutoROM                      0.6.1
    AutoROM.accept-rom-license   0.6.1
    cachetools                   5.3.0
    certifi                      2022.12.7
    charset-normalizer           3.1.0
    click                        8.0.4
    cloudpickle                  2.2.1
    cmake                        3.26.3
    contourpy                    1.0.7
    cycler                       0.11.0
    dacite                       1.8.0
    diambra                      0.0.14
    diambra-arena                2.1.0rc5
    diambra-engine               2.1.0rc11
    distlib                      0.3.6
    distro                       1.8.0
    dm-tree                      0.1.8
    filelock                     3.12.0
    flatbuffers                  23.3.3
    fonttools                    4.39.3
    frozenlist                   1.3.3
    gast                         0.4.0
    google-auth                  2.17.3
    google-auth-oauthlib         0.4.6
    google-pasta                 0.2.0
    grpcio                       1.43.0
    gym                          0.21.0
    h5py                         3.8.0
    idna                         3.4
    imageio                      2.28.0
    importlib-metadata           4.13.0
    importlib-resources          5.12.0
    inputs                       0.5
    Jinja2                       3.1.2
    jsonschema                   4.17.3
    keras                        2.10.0
    Keras-Preprocessing          1.1.2
    kiwisolver                   1.4.4
    lazy_loader                  0.2
    libclang                     16.0.0
    lit                          16.0.1
    lz4                          4.3.2
    Markdown                     3.4.3
    markdown-it-py               2.2.0
    MarkupSafe                   2.1.2
    matplotlib                   3.7.1
    mdurl                        0.1.2
    mpmath                       1.3.0
    msgpack                      1.0.5
    networkx                     3.1
    numpy                        1.24.3
    nvidia-cublas-cu11           11.10.3.66
    nvidia-cuda-cupti-cu11       11.7.101
    nvidia-cuda-nvrtc-cu11       11.7.99
    nvidia-cuda-runtime-cu11     11.7.99
    nvidia-cudnn-cu11            8.5.0.96
    nvidia-cufft-cu11            10.9.0.58
    nvidia-curand-cu11           10.2.10.91
    nvidia-cusolver-cu11         11.4.0.1
    nvidia-cusparse-cu11         11.7.4.91
    nvidia-nccl-cu11             2.14.3
    nvidia-nvtx-cu11             11.7.91
    oauthlib                     3.2.2
    opencv-python                4.7.0.72
    opt-einsum                   3.3.0
    packaging                    23.1
    pandas                       2.0.0
    Pillow                       9.5.0
    pip                          23.0.1
    pkgutil_resolve_name         1.3.10
    platformdirs                 3.2.0
    protobuf                     3.19.6
    psutil                       5.9.5
    pyasn1                       0.5.0
    pyasn1-modules               0.3.0
    Pygments                     2.15.1
    pyparsing                    3.0.9
    pyrsistent                   0.19.3
    python-dateutil              2.8.2
    pytz                         2023.3
    PyWavelets                   1.4.1
    PyYAML                       6.0
    ray                          2.0.0
    requests                     2.28.2
    requests-oauthlib            1.3.1
    rich                         13.3.4
    rsa                          4.9
    scikit-image                 0.20.0
    scipy                        1.9.1
    screeninfo                   0.8.1
    setuptools                   66.0.0
    six                          1.16.0
    stable-baselines3            1.8.0
    sympy                        1.11.1
    tabulate                     0.9.0
    tensorboard                  2.10.1
    tensorboard-data-server      0.6.1
    tensorboard-plugin-wit       1.8.1
    tensorboardX                 2.6
    tensorflow                   2.10.0
    tensorflow-estimator         2.10.0
    tensorflow-io-gcs-filesystem 0.32.0
    termcolor                    2.3.0
    tifffile                     2023.4.12
    tk                           0.1.0
    torch                        1.12.1
    tqdm                         4.65.0
    triton                       2.0.0
    typing_extensions            4.5.0
    tzdata                       2023.3
    urllib3                      1.26.15
    virtualenv                   20.22.0
    Werkzeug                     2.2.3
    wheel                        0.38.4
    wrapt                        1.15.0
    zipp                         3.15.0
  5. Have you tried with a smaller number of environments? Like 4? Yes I tried 8, 4, 3, 2 envs but only the 2 envs can get work.
amit-gshe commented 1 year ago

The docker version I'am using:

(diambra-arena-sb3) ➜  ~ docker version                          
Client: Docker Engine - Community
 Version:           23.0.4
 API version:       1.42
 Go version:        go1.19.8
 Git commit:        f480fb1
 Built:             Fri Apr 14 10:32:16 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          23.0.4
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.8
  Git commit:       cbce331
  Built:            Fri Apr 14 10:32:16 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.20
  GitCommit:        2806fc1057397dbaeefbea0e4e17bddfbd388f38
 runc:
  Version:          1.1.5
  GitCommit:        v1.1.5-0-gf19387a
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
alexpalms commented 1 year ago

From a quick check using your training script, everything seems to work as expected. See here 8 parallel envs: image

amit-gshe commented 1 year ago

Here is the fulllog of 3 envs:

(diambra-arena-sb3) ➜  ai diambra run -d -n -s=3 python kof_opt.py
πŸ–₯                                                                                                                
πŸ–₯  Starting DIAMBRA environment:                                                                                 
πŸ–₯  starting diambra                                                                                              
πŸ–₯  Request                                                                                                       
πŸ–₯  logged in                                                                                                     
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  creating container                                                                                            
πŸ–₯  container running                                                                                             
πŸ–₯  (75f9) started env container                                                                                  
πŸ–₯  waiting for grpc                                                                                              
Stored credentials found.                                                                                        
Authorization granted.
Server listening on 0.0.0.0:50051
πŸ–₯  closing streamer                                                                                              
πŸ–₯  closing                                                                                                       
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
πŸ–₯  container running                                                                                             
πŸ–₯  (161c) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  creating env container                                                                                        
πŸ–₯  mapping port                                                                                                  
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  adding bind mount                                                                                             
πŸ–₯  in go func                                                                                                    
πŸ–₯  creating container                                                                                            
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (161c) Stored credentials found.                                                                               
πŸ–₯  container running                                                                                             
πŸ–₯  (e76f) started env container                                                                                  
πŸ–₯  logs copying..                                                                                                
πŸ–₯  DIAMBRA environment started                                                                                   
πŸ–₯  running command                                                                                               
πŸ–₯  in go func                                                                                                    
πŸ–₯  copying logs in LogLogs                                                                                       
🏟 (e76f) Stored credentials found.                                                                               
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32771 (timeout=60s)...  
INFO:diambra.arena.engine.interface:... done.
🏟 (75f9)                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                 
      .:-:-**#*#####+***+=-:.                                                                                    
 ..-++####+#################=+=:.                                                                                
:+*#*###########################*-.                                                                              
   .-+#############################=.                                                                            
      .-*###########################+.                                                                           
🏟 (75f9)         .=######++======++*#########*#=. ........ ...     .......     ...........     ........     ........ ............     ............         ...........                                                            
          -*=------:---------=*########- .------..-----:. .------.   .-----------.    --------.   :-------- --------------:. --------------:.    :----------:                                                                     
          .:--------:---------:=#######* .------..-------..------.   :-----:-----:.   ---------. .--------- ------:..:-----: ------:..------:   .------:-----.                                                                    
        .:------::---::--------:+#######..------. .------..------.  .------.:-----.   ----------.---------- ------:..------. ------:  :-----.  .:-----:.------.                                                                   
        :-----::.:---:-----::---:######* .------. .------..------. .:-----: .------.  ------:-------------- ------::-----:.. ------:.-----:.   .------..------.                                                                   
       .--:..  . :::.:.---.-:---.#####*- .------. .------..------. .------.::------:  ------:.-----.:------ ------:  :-----: ------:.:------: .------:.:-------.                                                                  
        ..:..:::.:::.::....:.::--:####*. .------::------:..------..------:.:::------. ------: .---. :-----: -------::------: ------:  ------: :------..::------:                                                                  
       .-::::::---...:--. .-.:-::=###+.  .::::::::::::..  .::::::..::::::.   .::::::. ::::::.  .:.  ::::::: ::::::::::::::.  :::::::  ::::::: ::::::.    .::::::                                                                  
       .--:.   .:--...::. ..::+####*:                                                                            
       :--:..   .---:...:.   .:+*+:.                                                                             
       .:---:::----.. ....    .:.                                                                                
        .:------:..:...  :.                                                                                      
         .......   ..:.  ..                                                                                      

                                                                   DIAMBRAβ„’ | Dueling AI Arena                   
                                                              https://diambra.ai - info@diambra.ai               

                                   Usage of this software is subject to our Terms of Use described at https://diambra.ai/terms                                                                                                    

                                                         DIAMBRAβ„’ is a Trade Mark, Β© Copyright 2018-2023         

-----------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                 

🏟 (75f9)                                                                                                         
Environment initialization ...                                                                                   
🏟 (75f9) SHA256 check ok. Correct rom file found.                                                                
🏟 (75f9) Fontconfig error: Cannot load default config file                                                       
🏟 (75f9) 1 Completed console init                                                                                
🏟 (75f9) Warning: -video none doesn't make much sense without -seconds_to_run                                    
🏟 (75f9) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                 
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                   
🏟 (75f9) Unable to create history.db                                                                             
🏟 (75f9) Unable to create history.db                                                                             
🏟 (75f9) Unable to create history.db                                                                             
🏟 (75f9) Num. of Channels = 4                                                                                    
Screen Dim (W x H) = 320 240                                                                                     
🏟 (75f9) Closing the stream gobbler                                                                              
🏟 (75f9) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value                                                                                                                
(Recorder) Frame encoding enabled.                                                                               
(Recorder) Compression quality: 95                                                                               
(0)Buttons configuration:                                                                                        
(0)  SK = But4                                                                                                   
(0)  SP = But3                                                                                                   
(0)  WK = But2                                                                                                   
(0)  WP = But1                                                                                                   
(0)Game Continue Val = 0                                                                                         
(0)Show final = 0                                                                                                
🏟 (75f9) (0)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                  
(0)1P Environment                                                                                                
(0)Player side = P1                                                                                              
(0)Number of outfits = [1, 1]                                                                                    
🏟 (75f9) done.                                                                                                   
Native frame shape = [240 X 320 X 4]                                                                             
User defined frame_shape = [128 X 128 X 1]                                                                       
Resize flag = 1                                                                                                  
Grayscale flag = 1                                                                                               
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=0, seed=0, env_address='127.0.0.1:32771', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=False, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='discrete', attack_but_combination=False, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
Training a new model
Using cpu device
Wrapping the env in a VecTransposeImage.
/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py:148: UserWarning: You have specified a mini-batch size of 128, but because the `RolloutBuffer` is of size `n_steps * n_envs = 192`, after every 1 untruncated mini-batches, there will be a truncated mini-batch of size 64
We recommend using a `batch_size` that is a factor of `n_steps * n_envs`.
Info: (n_steps=64 and n_envs=3)
  warnings.warn(
/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/policies.py:457: UserWarning: As shared layers in the mlp_extractor are removed since SB3 v1.8.0, you should now pass directly a dictionary and not a list (net_arch=dict(pi=..., vf=...) instead of net_arch=[dict(pi=..., vf=...)])
  warnings.warn(
Begin training the agent
🏟 (75f9) (0)Setting difficulty = 1                                                                               
🏟 (75f9) (0)Player starting side = P1                                                                            
(0)Restarting system                                                                                             
🏟 (75f9) (0)Starting game                                                                                        
(0)Selecting player and arts                                                                                     
(0)Player 1                                                                                                      
(0)P1 = [Kyo, Andy, Joe]                                                                                         
(0)Fighting style: 1                                                                                             
🏟 (75f9) (0)Waiting for fight to start                                                                           
Process ForkServerProcess-2:                                                                                     
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 23, in __init__
    self.client = Client(env_address, grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/engine/__init__.py", line 11, in __init__
    grpc.channel_ready_future(self.channel).result(timeout=timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/stable_baselines3/make_sb3_env.py", line 41, in _init
    env = diambra.arena.make(game_id, env_settings, wrappers_settings,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/make_env.py", line 54, in make
    env = DiambraGym1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 334, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 29, in __init__
    self.arena_engine = DiambraEngine(env_settings.env_address, env_settings.grpc_timeout)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 26, in __init__
    raise Exception(CONNECTION_FAILED_ERROR_TEXT) from e
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Traceback (most recent call last):
  File "kof_opt.py", line 75, in <module>
    agent = PPO.load(model_path, env)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 663, in load
    data, params, pytorch_variables = load_from_zip_file(
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/save_util.py", line 390, in load_from_zip_file
    load_path = open_path(load_path, "r", verbose=verbose, suffix="zip")
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/save_util.py", line 234, in open_path_str
    return open_path(pathlib.Path(path), mode, verbose, suffix)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/save_util.py", line 286, in open_path_pathlib
    return open_path(path, mode, verbose, suffix)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/save_util.py", line 266, in open_path_pathlib
    raise error
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/save_util.py", line 258, in open_path_pathlib
    path = path.open("rb")
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'models_kof_opt/3.zip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "kof_opt.py", line 90, in <module>
    agent.learn(total_timesteps=2000000, callback=[auto_save_callback, action_callback], progress_bar=True)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py", line 308, in learn
    return super().learn(
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 239, in learn
    total_timesteps, callback = self._setup_learn(
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 412, in _setup_learn
    self._last_obs = self.env.reset()  # pytype: disable=annotation-type-mismatch
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 110, in reset
    return self.transpose_observations(self.venv.reset())
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 136, in reset
    obs = [remote.recv() for remote in self.remotes]
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 136, in <listcomp>
    obs = [remote.recv() for remote in self.remotes]
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
πŸ–₯  (75f9) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (161c) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  (e76f) stopping container                                                                                     
πŸ–₯  done copying logs in LogLogs: <nil>                                                                           
πŸ–₯  end of go func                                                                                                
πŸ–₯  Couldn't run: exit status 1                                                                                   
(diambra-arena-sb3) ➜  ai 
amit-gshe commented 1 year ago

Hello, today I have found that this issue seems caused by the wrong number of mame.real process created by the agent. When I create 4 envs and 4 docker agent process setup successfully but only have 3 mame.real process created. image

discordianfish commented 1 year ago

@amit-gshe Can you check the logs of the docker containers? docker logs <container-name/id>. You can also use diambra run --env.autoremove=false to prevent the diambra cli from deleting the docker containers after the main command finished/failed. Remember to delete the containers after checking the logs

discordianfish commented 1 year ago

I suspect one of the mame processes crashed while engine didn't (probably another case where we need to improve error handling in engine)

alexpalms commented 1 year ago

I suspect one of the mame processes crashed while engine didn't (probably another case where we need to improve error handling in engine)

I just tested it and it seems that if the execution of mame fails at startup, the engine crashes too (throws runtime_error exception). It is true though that if the emulator crashes after startup (after it has been initialized) the engine is not able to detect the fail.

amit-gshe commented 1 year ago

I think I have found the reason: the engine will validate the credentials at startup but for some network issue this step will hang for a long time. These engine containers logs Stored credentials found. then exit when the grpc timeout. I tried setting grpc_timeout to 120 seconds but still not work.

Maybe a proxy setting is needed when setup engine containers for those who have bad network condition. Or can we adjust the timing of validating the credential instead of verifying when starting every container? Or can we simply just ignore those failing containers and continue the training process? image

alexpalms commented 1 year ago

I think I have found the reason: the engine will validate the credentials at startup but for some network issue this step will hang for a long time. These engine containers logs Stored credentials found. then exit when the grpc timeout. I tried setting grpc_timeout to 120 seconds but still not work.

Maybe a proxy setting is needed when setup engine containers for those who have bad network condition. Or can we adjust the timing of validating the credential instead of verifying when starting every container? Or can we simply just ignore those failing containers and continue the training process? image

@discordianfish @amit-gshe do you know a way to replicate a slow internet connetion for reproducing this problem? I am not completely sure this is the problem though, as in case the authentication step fails (for whatever reason, including timeout in the API request), the docker container should fail and return an error message, isn't it @discordianfish ?

discordianfish commented 1 year ago

If I understand @amit-gshe correctly, the theory is that after 'stored credentials found' engine will try to validate them but not timeout. @alexpalms Should it timeout? Can we make it retry (a few times maybe with exponential backoff ideally). You can probably reproduce somthing similar adding the api domain to your /etc/hosts with some unreachable IP.

alexpalms commented 1 year ago

If I understand @amit-gshe correctly, the theory is that after 'stored credentials found' engine will try to validate them but not timeout. @alexpalms Should it timeout? Can we make it retry (a few times maybe with exponential backoff ideally). You can probably reproduce somthing similar adding the api domain to your /etc/hosts with some unreachable IP.

@discordianfish I understood the same thing, and I think it should timeout, but not sure. Thanks for the suggestion on how to replicate it, I will try it locally. In the meantime @amit-gshe, I just pushed a new engine that better handles emulator crashes, can you retry with it? It should be automatically pulled by docker (tag: v2.1.0-rc14) when you run scripts using our command line interface (e.g. diambra run python script.py)

amit-gshe commented 1 year ago

@alexpalms I tried the latest engine and now the validation issue has gone away. Now if I start 8 envs and all 8 mame.real process created successfully. But some mame process will hang after the log: Unable to create history.db ant not print Num. of Channels = 4. Those hung processes are constantly using 100% of the CPU. Eg: Here are 2 engine containers take up full cpu. image Here are logs of these hanging containers. image Here are top command shows the 2 dead locking containers. image Here are the normal container's log: image

After a long long time, those abnormal engine containers fails with logs: 1 FAILED TO REGISTER MAME resources

Environment initialization ...                                                                                                 
🏟 (6fce) Environment initialization ...                                                                                        
🏟 (3b8c) Environment initialization ...                                                                                        
🏟 (c9e9) SHA256 check ok. Correct rom file found.                                                                              
🏟 (438f) SHA256 check ok. Correct rom file found.                                                                              
🏟 (5e7e) SHA256 check ok. Correct rom file found.                                                                              
🏟 (c9e9) Fontconfig error: Cannot load default config file                                                                     
🏟 (a665) SHA256 check ok. Correct rom file found.                                                                              
🏟 (c9e9) 1 Completed console init                                                                                              
🏟 (9771) SHA256 check ok. Correct rom file found.                                                                              
🏟 (6fce) SHA256 check ok. Correct rom file found.                                                                              
🏟 (70c4) Environment initialization ...                                                                                        
🏟 (438f) Fontconfig error: Cannot load default config file                                                                     
🏟 (3b8c) SHA256 check ok. Correct rom file found.                                                                              
🏟 (438f) 1 Completed console init                                                                                              
🏟 (a665) Fontconfig error: Cannot load default config file                                                                     
🏟 (9771) Fontconfig error: Cannot load default config file                                                                     
🏟 (5e7e) Fontconfig error: Cannot load default config file                                                                     
🏟 (a665) 1 Completed console init                                                                                              
🏟 (438f) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (438f) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (9771) 1 Completed console init                                                                                              
🏟 (5e7e) 1 Completed console init                                                                                              
🏟 (c9e9) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (c9e9) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (a665) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (a665) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (5e7e) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (5e7e) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (6fce) Fontconfig error: Cannot load default config file                                                                     
🏟 (6fce) 1 Completed console init                                                                                              
🏟 (3b8c) Fontconfig error: Cannot load default config file                                                                     
🏟 (9771) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (6fce) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (6fce) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (9771) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (3b8c) 1 Completed console init                                                                                              
🏟 (3b8c) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (3b8c) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (70c4) SHA256 check ok. Correct rom file found.                                                                              
🏟 (70c4) Fontconfig error: Cannot load default config file                                                                     
🏟 (70c4) 1 Completed console init                                                                                              
🏟 (70c4) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (70c4) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (a665) Unable to create history.db                                                                                           
🏟 (a665) Unable to create history.db                                                                                           
🏟 (a665) Unable to create history.db                                                                                           
🏟 (438f) Unable to create history.db                                                                                           
🏟 (438f) Unable to create history.db                                                                                           
🏟 (438f) Unable to create history.db                                                                                           
🏟 (5e7e) Unable to create history.db                                                                                           
🏟 (5e7e) Unable to create history.db                                                                                           
🏟 (5e7e) Unable to create history.db                                                                                           
🏟 (9771) Unable to create history.db                                                                                           
🏟 (9771) Unable to create history.db                                                                                           
🏟 (9771) Unable to create history.db                                                                                           
🏟 (6fce) Unable to create history.db                                                                                           
🏟 (6fce) Unable to create history.db                                                                                           
🏟 (6fce) Unable to create history.db                                                                                           
🏟 (3b8c) Unable to create history.db                                                                                           
🏟 (3b8c) Unable to create history.db                                                                                           
🏟 (3b8c) Unable to create history.db                                                                                           
🏟 (c9e9) Unable to create history.db                                                                                           
🏟 (c9e9) Unable to create history.db                                                                                           
🏟 (c9e9) Unable to create history.db                                                                                           
🏟 (70c4) Unable to create history.db                                                                                           
🏟 (70c4) Unable to create history.db                                                                                           
🏟 (70c4) Unable to create history.db                                                                                           
🏟 (3b8c) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (9771) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (6fce) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (a665) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (3b8c) Closing the stream gobbler                                                                                            
🏟 (c9e9) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (9771) Closing the stream gobbler                                                                                            
🏟 (6fce) Closing the stream gobbler                                                                                            
🏟 (a665) Closing the stream gobbler                                                                                            
🏟 (c9e9) Closing the stream gobbler                                                                                            
🏟 (70c4) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (70c4) Closing the stream gobbler                                                                                            
🏟 (3b8c) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
🏟 (6fce) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
(6)Buttons configuration:                                                                                                      
(6)  SK = But4                                                                                                                 
(6)  SP = But3                                                                                                                 
(6)  WK = But2                                                                                                                 
(6)  WP = But1                                                                                                                 
(6)Game Continue Val = 0                                                                                                       
(6)Show final = 0                                                                                                              
🏟 (a665) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
🏟 (6fce) (6)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                
(6)1P Environment                                                                                                              
(6)Player side = P1                                                                                                            
🏟 (6fce) (6)Number of outfits = [1, 1]                                                                                         
🏟 (a665) (1)Buttons configuration:                                                                                             
(1)  SK = But4                                                                                                                 
(1)  SP = But3                                                                                                                 
🏟 (a665) (1)  WK = But2                                                                                                        
(1)  WP = But1                                                                                                                 
🏟 (a665) (1)Game Continue Val = 0                                                                                              
(1)Show final = 0                                                                                                              
🏟 (a665) (1)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                
🏟 (3b8c) (3)Buttons configuration:                                                                                             
(3)  SK = But4                                                                                                                 
(3)  SP = But3                                                                                                                 
🏟 (3b8c) (3)  WK = But2                                                                                                        
(3)  WP = But1                                                                                                                 
(3)Game Continue Val = 0                                                                                                       
(3)Show final = 0                                                                                                              
🏟 (a665) (1)1P Environment                                                                                                     
(1)Player side = P1                                                                                                            
(1)Number of outfits = [1, 1]                                                                                                  
🏟 (6fce) done.                                                                                                                 
Native frame shape = [240 X 320 X 4]                                                                                           
User defined frame_shape = [128 X 128 X 1]                                                                                     
Resize flag = 1                                                                                                                
🏟 (6fce) Grayscale flag = 1                                                                                                    
🏟 (a665) done.                                                                                                                 
Native frame shape = [240 X 320 X 4]                                                                                           
User defined frame_shape = [128 X 128 X 1]                                                                                     
Resize flag = 1                                                                                                                
Grayscale flag = 1                                                                                                             
🏟 (3b8c) (3)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                
(3)1P Environment                                                                                                              
(3)Player side = P1                                                                                                            
(3)Number of outfits = [1, 1]                                                                                                  
🏟 (3b8c) done.                                                                                                                 
Native frame shape = [240 X 320 X 4]                                                                                           
User defined frame_shape = [128 X 128 X 1]                                                                                     
Resize flag = 1                                                                                                                
Grayscale flag = 1                                                                                                             
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=6, seed=104, env_address='127.0.0.1:32823', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=1, seed=99, env_address='127.0.0.1:32818', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=3, seed=101, env_address='127.0.0.1:32820', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
🏟 (70c4) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
(7)Buttons configuration:                                                                                                      
(7)  SK = But4                                                                                                                 
(7)  SP = But3                                                                                                                 
(7)  WK = But2                                                                                                                 
(7)  WP = But1                                                                                                                 
(7)Game Continue Val = 0                                                                                                       
(7)Show final = 0                                                                                                              
(7)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                         
(7)1P Environment                                                                                                              
(7)Player side = P1                                                                                                            
(7)Number of outfits = [1, 1]                                                                                                  
🏟 (70c4) done.                                                                                                                 
Native frame shape = [240 X 320 X 4]                                                                                           
User defined frame_shape = [128 X 128 X 1]                                                                                     
🏟 (70c4) Resize flag = 1                                                                                                       
Grayscale flag = 1                                                                                                             
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=7, seed=105, env_address='127.0.0.1:32824', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
🏟 (5e7e) 1 FAILED TO REGISTER MAME resources                                                                                   
🏟 (5e7e) terminate called without an active exception                                                                          
🏟 (438f) 1 FAILED TO REGISTER MAME resources                                                                                   
terminate called without an active exception                                                                                   
Process ForkServerProcess-3:                                                                                                   
Process ForkServerProcess-6:
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 148, in _env_init
    response = self.client.EnvInit(env_settings_pb)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1684122170.210990382","description":"Error received from peer ipv4:127.0.0.1:32819","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket closed","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 151, in _env_init
    response = self.client.GetError(model.Empty())
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1684122170.211816025","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1684122170.211815218","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/demo/ai/env.py", line 151, in _init
    env = make(game_id, env_settings, wrappers_settings,
  File "/home/amit/demo/ai/env.py", line 93, in make
    env = CustomizedDiambraGymHardcore1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 32, in __init__
    env_info_dict = self.arena_engine.env_init(self.env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 162, in env_init
    response = self._env_init(env_settings_pb)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 155, in _env_init
    raise Exception(CONNECTION_FAILED_ERROR_TEXT)
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 148, in _env_init
    response = self.client.EnvInit(env_settings_pb)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1684122170.211306497","description":"Error received from peer ipv4:127.0.0.1:32822","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket closed","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 151, in _env_init
    response = self.client.GetError(model.Empty())
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1684122170.211996807","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1684122170.211996124","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 25, in _worker
    env = env_fn_wrapper.var()
  File "/home/amit/demo/ai/env.py", line 151, in _init
    env = make(game_id, env_settings, wrappers_settings,
  File "/home/amit/demo/ai/env.py", line 93, in make
    env = CustomizedDiambraGymHardcore1P(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 199, in __init__
    super().__init__(env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/arena_gym.py", line 32, in __init__
    env_info_dict = self.arena_engine.env_init(self.env_settings)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 162, in env_init
    response = self._env_init(env_settings_pb)
  File "/home/amit/miniconda3/envs/diambra-arena-sb3/lib/python3.8/site-packages/diambra/arena/engine/interface.py", line 155, in _env_init
    raise Exception(CONNECTION_FAILED_ERROR_TEXT)
Exception: DIAMBRA Arena failed to connect to the Engine Server.
 - If you are running a Python script, are you running with DIAMBRA CLI: `diambra run python script.py`?
 - If you are running a Python Notebook, have you started Jupyter Notebook with DIAMBRA CLI: `diambra run jupyter notebook`?

See the docs (https://docs.diambra.ai) for additional details, or join DIAMBRA Discord Server (https://discord.gg/tFDS2UN5sv) for support.
πŸ–₯  done copying logs in LogLogs: <nil>                                                                                         
πŸ–₯  end of go func                                                                                                              
πŸ–₯  done copying logs in LogLogs: <nil>                                                                                         
πŸ–₯  end of go func 
alexpalms commented 1 year ago

@amit-gshe thank you for the feedback, it is interesting (and strange). As a final test step, could you please retry now? As we just released a new engine version that, together with the previous mod, better handles the license auth timeout for slow internet connections. It should be automatically pulled by docker (tag: v2.1.0-rc15) when you run scripts using our command line interface (e.g. diambra run python script.py)

amit-gshe commented 1 year ago

@alexpalms I just tried the latest engine image, and I can confirm that the authentication problem is now normal, but the problem of 100% CPU usage of one or more engine containers will still occur randomly. When I start 8 environments, almost every time I encounter one or two engine containers occupying 100% of the CPU and then get stuck, which makes it impossible to start training. Occasionally, if I am lucky, if the CPU of all containers is normal, I can enter the game and start training.

alexpalms commented 1 year ago

@amit-gshe thanks a lot for the feedback. Good you could confirm the authentication is now more robust. Regarding the CPU problem that is preventing training to start, I would like you to ask an additional test: could you test a few other games, in particular

and see if the problem is still there also for them?

amit-gshe commented 1 year ago

@alexpalms I tried several other games above(doapp sfiii3n tektagt), all have the same problem.

alexpalms commented 1 year ago

@amit-gshe ok thanks for this additional test. We will review the whole thread and your inputs to gain some insights on what can be happening and how to reproduce. We will keep you posted. In case you will have additional clues or elements, do not hesitate to post additional comments here!

alexpalms commented 1 year ago

Hey @amit-gshe, I worked a bit on the engine to improve startup speed and robustness. I pushed a new custom engine in my personal dockerhub so that you can test it. The engine docker image is called alexpalmas/engine:robust and you can easily specify it directly to our command line interface as follows:

diambra run -d -s=10 --env.image alexpalmas/engine:robust python kof_opt.py

Note that you have to remove the -n option from your original command as it could prevent to pull the image

It would be great if you could use it to test your systems with a few of the games, so kof98umh but also the others like doapp, sfiii3n, tektagt

Looking forward to hearing your feedback!

amit-gshe commented 1 year ago

@alexpalms Thank you for your work, I just tried the image you provided, and the problem of some containers cpu has been 100% is still not resolved. The container log is the same as the one provided above.

alexpalms commented 1 year ago

@amit-gshe can you post the full log containing the final error of the containers? I would like to see the error

alexpalms commented 1 year ago

@amit-gshe I just pushed a new version of the same engine docker image alexpalmas/engine:robust with some additional logging info. Could you please re-run it and share here the full log of the hanging containers? Please (if feasible) wait until the container fails

amit-gshe commented 1 year ago

@alexpalms Sorry for taking so long to reply you. I tried your latest image and now the hung container never seems to exit with an error. Below is the output of docker stats and all script logs image

(diambra-arena-sb3) ➜  ai diambra run -d -s=8 --env.image alexpalmas/engine:robust python3 kof_cnn.py
πŸ–₯                                                                                                                              
πŸ–₯  Starting DIAMBRA environment:                                                                                               
πŸ–₯  starting diambra                                                                                                            
πŸ–₯  Request                                                                                                                     
πŸ–₯  logged in                                                                                                                   
πŸ–₯  creating env container                                                                                                      
robust: Pulling from alexpalmas/engine
Digest: sha256:99d49321313d5ecb73e163b2da94aeec516396bb25778b8d03fa898abde2a349
Status: Image is up to date for alexpalmas/engine:robust
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  creating container                                                                                                          
πŸ–₯  container running                                                                                                           
πŸ–₯  (84c4) started env container                                                                                                
πŸ–₯  waiting for grpc                                                                                                            
Stored credentials found.
Authorization granted.
Server listening on 0.0.0.0:50051
πŸ–₯  closing streamer                                                                                                            
πŸ–₯  closing                                                                                                                     
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  in go func                                                                                                                  
πŸ–₯  creating container                                                                                                          
πŸ–₯  copying logs in LogLogs                                                                                                     
πŸ–₯  container running                                                                                                           
πŸ–₯  (450e) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  in go func                                                                                                                  
πŸ–₯  creating container                                                                                                          
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (450e) Stored credentials found.                                                                                             
πŸ–₯  container running                                                                                                           
πŸ–₯  (8eda) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  creating container                                                                                                          
πŸ–₯  in go func                                                                                                                  
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (8eda) Stored credentials found.                                                                                             
πŸ–₯  container running                                                                                                           
πŸ–₯  (b6b3) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  in go func                                                                                                                  
πŸ–₯  creating container                                                                                                          
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (b6b3) Stored credentials found.                                                                                             
πŸ–₯  container running                                                                                                           
πŸ–₯  (c7f0) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  in go func                                                                                                                  
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  creating container                                                                                                          
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (c7f0) Stored credentials found.                                                                                             
πŸ–₯  container running                                                                                                           
πŸ–₯  (e3fc) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  creating container                                                                                                          
πŸ–₯  in go func                                                                                                                  
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (e3fc) Stored credentials found.                                                                                             
πŸ–₯  container running                                                                                                           
πŸ–₯  (4360) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  creating env container                                                                                                      
πŸ–₯  mapping port                                                                                                                
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  adding bind mount                                                                                                           
πŸ–₯  in go func                                                                                                                  
πŸ–₯  creating container                                                                                                          
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (4360) Stored credentials found.                                                                                             
🏟 (b6b3) Authorization granted.                                                                                                
🏟 (b6b3) Server listening on 0.0.0.0:50051                                                                                     
πŸ–₯  container running                                                                                                           
πŸ–₯  (c848) started env container                                                                                                
πŸ–₯  logs copying..                                                                                                              
πŸ–₯  DIAMBRA environment started                                                                                                 
πŸ–₯  in go func                                                                                                                  
πŸ–₯  running command                                                                                                             
πŸ–₯  copying logs in LogLogs                                                                                                     
🏟 (c848) Stored credentials found.                                                                                             
🏟 (450e) Authorization granted.                                                                                                
🏟 (450e) Server listening on 0.0.0.0:50051                                                                                     
🏟 (8eda) Authorization granted.                                                                                                
🏟 (8eda) Server listening on 0.0.0.0:50051                                                                                     
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32980 (timeout=60s)...                
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32979 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32981 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32978 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
🏟 (c7f0) Authorization granted.                                                                                                
🏟 (c7f0) Server listening on 0.0.0.0:50051                                                                                     
🏟 (e3fc) Authorization granted.                                                                                                
🏟 (e3fc) Server listening on 0.0.0.0:50051                                                                                     
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32982 (timeout=60s)...                
INFO:diambra.arena.engine.interface:... done.
🏟 (4360) Authorization granted.                                                                                                
🏟 (4360) Server listening on 0.0.0.0:50051                                                                                     
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32983 (timeout=60s)...                
INFO:diambra.arena.engine.interface:... done.
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32984 (timeout=60s)...
INFO:diambra.arena.engine.interface:... done.
🏟 (c848) Authorization granted.                                                                                                
🏟 (c848) Server listening on 0.0.0.0:50051                                                                                     
INFO:diambra.arena.engine.interface:Trying to connect to DIAMBRA Engine server 127.0.0.1:32985 (timeout=60s)...                
INFO:diambra.arena.engine.interface:... done.
🏟 (8eda) Environment initialization ...                                                                                        
🏟 (450e) Environment initialization ...                                                                                        
🏟 (b6b3) Environment initialization ...                                                                                        
🏟 (84c4)                                                                                                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                                             
      .:-:-**#*#####+***+=-:.                                                                                                  
 ..-++####+#################=+=:.                                                                                              
:+*#*###########################*-.                                                                                            
   .-+#############################=.                                                                                          
      .-*###########################+.                                                                                         
        .=######++======++*#########*#=. ........ ...     .......     ...........     ........     ........ ............     ............         ...........                                                                                                 
          -*=------:---------=*########- .------..-----:. .------.   .-----------.    --------.   :-------- --------------:. --------------:.    :----------:                                                                                                 
          .:--------:---------:=#######* .------..-------..------.   :-----:-----:.   ---------. .--------- ------:..:-----: ------:..------:   .------:-----.                                                                                                
        .:------::---::--------:+#######..------. .------..------.  .------.:-----.   ----------.---------- ------:..------. ------:  :-----.  .:-----:.------.                                                                                               
        :-----::.:---:-----::---:######* .------. .------..------. .:-----: .------.  ------:-------------- ------::-----:.. ------:.-----:.   .------..------.                                                                                               
🏟 (84c4)        .--:..  . :::.:.---.-:---.#####*- .------. .------..------. .------.::------:  ------:.-----.:------ ------:  :-----: ------:.:------: .------:.:-------.                                                                                     
        ..:..:::.:::.::....:.::--:####*. .------::------:..------..------:.:::------. ------: .---. :-----: -------::------: ------:  ------: :------..::------:                                                                                              
       .-::::::---...:--. .-.:-::=###+.  .::::::::::::..  .::::::..::::::.   .::::::. ::::::.  .:.  ::::::: ::::::::::::::.  :::::::  ::::::: ::::::.    .::::::                                                                                              
       .--:.   .:--...::. ..::+####*:                                                                                          
       :--:..   .---:...:.   .:+*+:.                                                                                           
       .:---:::----.. ....    .:.                                                                                              
        .:------:..:...  :.                                                                                                    
         .......   ..:.  ..                                                                                                    

                                                                   DIAMBRAβ„’ | Dueling AI Arena                                 
                                                              https://diambra.ai - info@diambra.ai                             

                                   Usage of this software is subject to our Terms of Use described at https://diambra.ai/terms 

                                                               DIAMBRA, Inc. Β© Copyright 2018-2023                             

-----------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                                             

Environment initialization ...                                                                                                 
🏟 (c7f0) Environment initialization ...                                                                                        
🏟 (b6b3) SHA256 check ok. Correct rom file found.                                                                              
🏟 (8eda) SHA256 check ok. Correct rom file found.                                                                              
🏟 (450e) SHA256 check ok. Correct rom file found.                                                                              
🏟 (b6b3) 1 Completed console init                                                                                              
🏟 (84c4) SHA256 check ok. Correct rom file found.                                                                              
🏟 (8eda) 1 Completed console init                                                                                              
🏟 (450e) 1 Completed console init                                                                                              
🏟 (84c4) 1 Completed console init                                                                                              
🏟 (8eda) Fontconfig error: Cannot load default config file                                                                     
🏟 (b6b3) Fontconfig error: Cannot load default config file                                                                     
🏟 (450e) Fontconfig error: Cannot load default config file                                                                     
🏟 (84c4) Fontconfig error: Cannot load default config file                                                                     
🏟 (84c4) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (8eda) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (84c4) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (8eda) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (b6b3) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (b6b3) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (450e) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (450e) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (c7f0) SHA256 check ok. Correct rom file found.                                                                              
🏟 (c7f0) 1 Completed console init                                                                                              
🏟 (c7f0) Fontconfig error: Cannot load default config file                                                                     
🏟 (c7f0) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (c7f0) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (e3fc) Environment initialization ...                                                                                        
🏟 (4360) Environment initialization ...                                                                                        
🏟 (450e) Registering screen ... done.                                                                                          
🏟 (8eda) Registering screen ... done.                                                                                          
🏟 (84c4) Registering screen ... done.                                                                                          
🏟 (e3fc) SHA256 check ok. Correct rom file found.                                                                              
🏟 (e3fc) 1 Completed console init                                                                                              
🏟 (4360) SHA256 check ok. Correct rom file found.                                                                              
🏟 (4360) 1 Completed console init                                                                                              
🏟 (e3fc) Fontconfig error: Cannot load default config file                                                                     
🏟 (4360) Fontconfig error: Cannot load default config file                                                                     
🏟 (e3fc) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (e3fc) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (4360) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (4360) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (c848) Environment initialization ...                                                                                        
🏟 (8eda) Registering audio ... done.                                                                                           
🏟 (450e) Registering audio ... done.                                                                                           
🏟 (84c4) Registering audio ... done.                                                                                           
🏟 (4360) Registering screen ... done.                                                                                          
🏟 (e3fc) Registering screen ... done.                                                                                          
🏟 (c848) SHA256 check ok. Correct rom file found.                                                                              
🏟 (c848) 1 Completed console init                                                                                              
🏟 (c848) Fontconfig error: Cannot load default config file                                                                     
🏟 (c848) Warning: -video none doesn't make much sense without -seconds_to_run                                                  
🏟 (c848) ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf                               
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default                                                                 
🏟 (e3fc) Registering audio ... done.                                                                                           
🏟 (c848) Registering screen ... done.                                                                                          
🏟 (4360) Registering audio ... done.                                                                                           
🏟 (c848) Registering audio ... done.                                                                                           
🏟 (b6b3) Unable to create history.db                                                                                           
🏟 (b6b3) Unable to create history.db                                                                                           
🏟 (b6b3) Unable to create history.db                                                                                           
🏟 (450e) Unable to create history.db                                                                                           
🏟 (450e) Unable to create history.db                                                                                           
🏟 (450e) Unable to create history.db                                                                                           
🏟 (8eda) Unable to create history.db                                                                                           
🏟 (8eda) Unable to create history.db                                                                                           
🏟 (8eda) Unable to create history.db                                                                                           
🏟 (84c4) Unable to create history.db                                                                                           
Unable to create history.db                                                                                                    
Unable to create history.db                                                                                                    
🏟 (e3fc) Unable to create history.db                                                                                           
🏟 (e3fc) Unable to create history.db                                                                                           
🏟 (e3fc) Unable to create history.db                                                                                           
🏟 (c7f0) Unable to create history.db                                                                                           
🏟 (c7f0) Unable to create history.db                                                                                           
🏟 (c7f0) Unable to create history.db                                                                                           
🏟 (8eda) Registering program ... done.                                                                                         
🏟 (c848) Unable to create history.db                                                                                           
🏟 (c848) Unable to create history.db                                                                                           
Unable to create history.db                                                                                                    
🏟 (450e) Registering program ... done.                                                                                         
🏟 (e3fc) Registering program ... done.                                                                                         
🏟 (4360) Unable to create history.db                                                                                           
🏟 (4360) Unable to create history.db                                                                                           
🏟 (4360) Unable to create history.db                                                                                           
🏟 (84c4) Registering program ... done.                                                                                         
🏟 (84c4) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (84c4) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
(0)Buttons configuration:                                                                                                      
(0)  SK = But4                                                                                                                 
(0)  SP = But3                                                                                                                 
(0)  WK = But2                                                                                                                 
(0)  WP = But1                                                                                                                 
🏟 (84c4) (0)Game Continue Val = 0                                                                                              
(0)Show final = 0                                                                                                              
(0)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                         
(0)1P Environment                                                                                                              
(0)Player side = P1                                                                                                            
(0)Number of outfits = [1, 1]                                                                                                  
🏟 (84c4) done.                                                                                                                 
Native frame shape = [240 X 320 X 4]                                                                                           
User defined frame_shape = [128 X 128 X 1]                                                                                     
Resize flag = 1                                                                                                                
Grayscale flag = 1                                                                                                             
🏟 (c848) Registering program ... done.                                                                                         
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=0, seed=98, env_address='127.0.0.1:32978', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
Training a new model
Using cpu device
Wrapping the env in a VecTransposeImage.
🏟 (c848) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (c848) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
(7)Buttons configuration:                                                                                                      
(7)  SK = But4                                                                                                                 
(7)  SP = But3                                                                                                                 
(7)  WK = But2                                                                                                                 
(7)  WP = But1                                                                                                                 
(7)Game Continue Val = 0                                                                                                       
(7)Show final = 0                                                                                                              
(7)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                         
(7)1P Environment                                                                                                              
(7)Player side = P1                                                                                                            
(7)Number of outfits = [1, 1]                                                                                                  
done.                                                                                                                          
🏟 (c848) Native frame shape = [240 X 320 X 4]                                                                                  
User defined frame_shape = [128 X 128 X 1]                                                                                     
Resize flag = 1                                                                                                                
Grayscale flag = 1                                                                                                             
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=7, seed=105, env_address='127.0.0.1:32985', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
🏟 (4360) Registering program ... done.                                                                                         
🏟 (4360) Num. of Channels = 4                                                                                                  
Screen Dim (W x H) = 320 240                                                                                                   
🏟 (4360) Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value             
(Recorder) Frame encoding enabled.                                                                                             
(Recorder) Compression quality: 95                                                                                             
(6)Buttons configuration:                                                                                                      
(6)  SK = But4                                                                                                                 
(6)  SP = But3                                                                                                                 
(6)  WK = But2                                                                                                                 
(6)  WP = But1                                                                                                                 
(6)Game Continue Val = 0                                                                                                       
(6)Show final = 0                                                                                                              
(6)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]                                                                         
(6)1P Environment                                                                                                              
(6)Player side = P1                                                                                                            
(6)Number of outfits = [1, 1]                                                                                                  
🏟 (4360) done.                                                                                                                 
Native frame shape = [240 X 320 X 4]                                                                                           
User defined frame_shape = [128 X 128 X 1]                                                                                     
Resize flag = 1                                                                                                                
Grayscale flag = 1                                                                                                             
INFO:diambra.arena.arena_gym:EnvironmentSettings1P(game_id='kof98umh', step_ratio=6, disable_keyboard=True, disable_joystick=True, rank=6, seed=104, env_address='127.0.0.1:32984', grpc_timeout=60, player='P1', continue_game=0.0, show_final=False, difficulty=1, frame_shape=(128, 128, 1), tower=3, hardcore=True, characters=('Kyo', 'Andy', 'Joe'), char_outfits=1, action_space='multi_discrete', attack_but_combination=True, super_art=0, fighting_style=1, ultimate_style=(0, 0, 0))
amit-gshe commented 1 year ago

I did another try and found that the logs might be different for each container. These engine containers have several different outputs. image hanging containers [high cpu usage]:

(diambra-arena-sb3) ➜  ai docker logs 727                 
Stored credentials found.
Authorization granted.
Server listening on 0.0.0.0:50051
Environment initialization ...
SHA256 check ok. Correct rom file found.
1 Completed console init
Fontconfig error: Cannot load default config file
Warning: -video none doesn't make much sense without -seconds_to_run
ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default
Unable to create history.db
Unable to create history.db
Unable to create history.db

hanging containers [low cpu usage]:

(diambra-arena-sb3) ➜  ai docker logs 25a
Stored credentials found.
Authorization granted.
Server listening on 0.0.0.0:50051
Environment initialization ...
SHA256 check ok. Correct rom file found.
1 Completed console init
Fontconfig error: Cannot load default config file
Warning: -video none doesn't make much sense without -seconds_to_run
ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default
Registering screen ... done.
Registering audio ... done.
Unable to create history.db
Unable to create history.db
Unable to create history.db
Registering program ... done.

normal containers:

(diambra-arena-sb3) ➜  ai docker logs c6a
Stored credentials found.
Authorization granted.
Server listening on 0.0.0.0:50051
Environment initialization ...
SHA256 check ok. Correct rom file found.
1 Completed console init
Fontconfig error: Cannot load default config file
Warning: -video none doesn't make much sense without -seconds_to_run
ALSA lib conf.c:4553:(snd_config_update_r) Cannot access file /usr/share/alsa/alsa.conf
ALSA lib seq.c:935:(snd_seq_open_noupdate) Unknown SEQ default
Registering screen ... done.
Registering audio ... done.
Unable to create history.db
Unable to create history.db
Unable to create history.db
Registering program ... done.
Num. of Channels = 4
Screen Dim (W x H) = 320 240
Warning: Cannot convert ENGINE_RECORDER_COMPRESSION env variable to integer, value: , using default value
(Recorder) Frame encoding enabled.
(Recorder) Compression quality: 95
(2)Buttons configuration:
(2)  SK = But4
(2)  SP = But3
(2)  WK = But2
(2)  WP = But1
(2)Game Continue Val = 0
(2)Show final = 0
(2)Characters = [ [Kyo, Andy, Joe], [Kyo, Andy, Joe] ]
(2)1P Environment
(2)Player side = P1
(2)Number of outfits = [1, 1]
done.
Native frame shape = [240 X 320 X 4]
User defined frame_shape = [128 X 128 X 1]
Resize flag = 1
Grayscale flag = 1
alexpalms commented 1 year ago

@amit-gshe Thanks a lot for your detailed feedback. It is really hard to say what is going wrong. I tested your training script using 10 and 20 envs in my local machine that has 6 cpus/12 threads and 64 Gb of RAM and both worked fine: vecenv_screenshot_17 05 2023

Not being able to replicate the problem makes hard spotting the issue here. I would like to ask you the two following things for the hanging containers with the high CPU load: 1) Access the hanging containers and do a top there to spot the process that is using the CPU (and share the results) 2) List the files you have in this path /tmp/DIAMBRA/pipes inside the container (and share the results)

Reasons being, I would like to understand which process is causing the CPU freeroll and also make sure the named fifo pipes are being created, as I suspect this is what is blocking it.

amit-gshe commented 1 year ago

@alexpalms I entered into the abnormal container and followed your instructions, the following is the result of the command image

alexpalms commented 1 year ago

@amit-gshe thanks for the feedback. So the top (or better htop) command is not installed in the container, so before running it, you need to install it with apt-get install htop.

In addition, I just pushed a new image that has some more debug log info, it is named alexpalmas/engine:debug, please give it a try and please: 1) share the log of the hanging containers (both with high and low CPU usage). 2) share the htop command output in the hanging containers with high CPU usage

amit-gshe commented 1 year ago

@alexpalms It's hard to install htop in the container because the bin dir dose not contains the apt-get command: image Please give me some advice to install the htop in the container. I tried use docker cp htop-static-binnary container but it didn't work and complains an error: image

amit-gshe commented 1 year ago

@alexpalms Hello, When I was about to install linux-perf to analyze the problem of 100% CPU usage, I found that the kernel I was using did not have a corresponding version of the linux-perf package. So when I tried another version of the kernel, the problem disappeared. The kernel version I used before was linux-headers-5.18.17-amd64-desktop-community-hwe. When I switched to linux-image-5.15.77-amd64-desktop, I could start training smoothly. This is a bit strange.

When I found that the problem was with the kernel, I tried another kernel: linux-headers-5.18.17-amd64-desktop-hwe, and the training started normally. So I started to compare the differences between the two kernels linux-headers-5.18.17-amd64-desktop-hwe and linux-headers-5.18.17-amd64-desktop-community-hwe.

The following is the .config file diff compared between the two kernels linux-headers-5.18.17-amd64-desktop-hwe and linux-headers-5.18.17-amd64-desktop-community-hwe. There are several configurations about CGROUP here, but I don't understand these configurations very well, and it is not clear whether these configurations prevent the creation of the env container. image

Anyway, after I switched kernels, I was able to start training without any problems. Of course I am also happy to provide further information on the above issues.

alexpalms commented 1 year ago

@amit-gshe this is so interesting! And you did a hell of a debug! Thanks a lot. I am not familiar with these aspects too, maybe @discordianfish can spot something about this subtle behavior you spotted? In the meantime I will try one last thing I have in mind that could cause problems in case of race conditions, and then I will merge the new engine in the official one. I will remove the debug tagged, will leave the robust until I cut a new official release

alexpalms commented 1 year ago

@amit-gshe I did the additional test I wanted to do and everything looks fine. So this new engine seems ready to be merged. It will be released in the coming days after a few tests. Thanks a lot for your support. I will keep this issue open in case we better understand what happened, but I am happy you managed to solve the issue and can run the environments smoothly!

discordianfish commented 1 year ago

Yeah really hard to say.. but yeah probably some kernel bug? For the cgroup stuff things should either work or not, not cause this behavior. But good that it's fixed! Let's close this issue, it will be still around and we can re-open if it happens to other people as well.