train stop at 690epoch - Githubissues

DeNA / HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

MIT License

282 stars 42 forks source link

train stop at 690epoch #125

Closed yoyoyo-yo closed 3 years ago

yoyoyo-yo commented 3 years ago

Thank you for nice library!

I train this, but training stop at 690 epoch. I don't know why this happen.

When then, I kill that job. Next, I changed train_args:restart_epoch in config.yaml like below in order to continue training.

train_args: restart_epoch: 690

Is this right method ?

Thanks

ikki407 commented 3 years ago

Thank you for your trial!

Yes, you are right about restart_epoch. Perhaps when you restart the training, minimum_episode parameter should be increased a bit for stable training (i.e. the variation of the episodes in the buffer increases when restarting the model update) But this depends on the game though...

I train this, but training stop at 690 epoch. I don't know why this happen.

Do you have any logs when error occurred?

Thanks!

yoyoyo-yo commented 3 years ago

@ikki407 Thank you!

No error occured , but the training process stopped. I tried Kaggle hungry geese environment. Below is config.yaml

env_args:
    #env: 'TicTacToe'
    #source: 'handyrl.envs.tictactoe'
    #env: 'Geister'
    #source: 'handyrl.envs.geister'
    env: 'HungryGeese'
    source: 'handyrl.envs.kaggle.hungry_geese'

train_args:
    turn_based_training: False
    observation: False
    gamma: 0.8
    forward_steps: 16
    compress_steps: 4
    entropy_regularization: 1.0e-1
    entropy_regularization_decay: 0.1
    update_episodes: 200
    batch_size: 256
    minimum_episodes: 400
    maximum_episodes: 100000
    num_batchers: 2
    eval_rate: 0.1
    worker:
        num_parallel: 6
    lambda: 0.7
    policy_target: 'TD' # 'UGPO' 'VTRACE' 'TD' 'MC'
    value_target: 'TD' # 'VTRACE' 'TD' 'MC'
    seed: 0
    restart_epoch: 0

worker_args:
    server_address: ''
    num_parallel: 8

ikki407 commented 3 years ago

Thanks. Your config looks right.

I think this may be a connection problem between learner and workers. In practice, the restart of the training is good enough, so could you use the way unless the process stops frequently?

Tips: if you have a time, please try server mode, that is —train-server and —worker. In this way, you can connect to the server from client again after the process stops.

Thanks!

qent commented 3 years ago

The same issue, 3 times start and 3 times stopped saving models after 693epoch, but actually, GPU used even after several days, until I kill the process

YuriCat commented 3 years ago

Hi, @yoyoyo-yo and @qent. We have updated model sending procedure and it will avoid PyTorch shared-memory error. Please try again!

ikki407 commented 3 years ago

Close this PR because #149 is merged into master.