Closed yoyoyo-yo closed 3 years ago
Thank you for your trial!
Yes, you are right about restart_epoch. Perhaps when you restart the training, minimum_episode parameter should be increased a bit for stable training (i.e. the variation of the episodes in the buffer increases when restarting the model update) But this depends on the game though...
I train this, but training stop at 690 epoch. I don't know why this happen.
Do you have any logs when error occurred?
Thanks!
@ikki407 Thank you!
No error occured , but the training process stopped. I tried Kaggle hungry geese environment. Below is config.yaml
env_args:
#env: 'TicTacToe'
#source: 'handyrl.envs.tictactoe'
#env: 'Geister'
#source: 'handyrl.envs.geister'
env: 'HungryGeese'
source: 'handyrl.envs.kaggle.hungry_geese'
train_args:
turn_based_training: False
observation: False
gamma: 0.8
forward_steps: 16
compress_steps: 4
entropy_regularization: 1.0e-1
entropy_regularization_decay: 0.1
update_episodes: 200
batch_size: 256
minimum_episodes: 400
maximum_episodes: 100000
num_batchers: 2
eval_rate: 0.1
worker:
num_parallel: 6
lambda: 0.7
policy_target: 'TD' # 'UGPO' 'VTRACE' 'TD' 'MC'
value_target: 'TD' # 'VTRACE' 'TD' 'MC'
seed: 0
restart_epoch: 0
worker_args:
server_address: ''
num_parallel: 8
Thanks. Your config looks right.
I think this may be a connection problem between learner and workers. In practice, the restart of the training is good enough, so could you use the way unless the process stops frequently?
Tips: if you have a time, please try server mode, that is —train-server
and —worker
. In this way, you can connect to the server from client again after the process stops.
Thanks!
The same issue, 3 times start and 3 times stopped saving models after 693epoch, but actually, GPU used even after several days, until I kill the process
Hi, @yoyoyo-yo and @qent. We have updated model sending procedure and it will avoid PyTorch shared-memory error. Please try again!
Close this PR because #149 is merged into master.
Thank you for nice library!
I train this, but training stop at 690 epoch. I don't know why this happen.
When then, I kill that job. Next, I changed train_args:restart_epoch in config.yaml like below in order to continue training.
train_args: restart_epoch: 690
Is this right method ?
Thanks