Ladun / OffPolicy-PPO

off policy proximal policy optimization implementation
1 stars 0 forks source link

run the Pendulum-v0 environment with this offpolicy ppo #1

Closed Kkeirl closed 3 months ago

Kkeirl commented 3 months ago

Hi, what should I do if I run the Pendulum-v0 environment with this offpolicy ppo code of yours? Can you please write a simple modified code?

Ladun commented 3 months ago

I apologize for the delayed response. To run Pendulum-v0, you don't need to modify the code separately. You can proceed by creating a config for Pendulum-v0, similar to configs/Ant-v4.yaml.

Example:

device: "cpu"
seed: 77
env:
    env_name: "InvertedPendulum-v4"
    num_envs: 8
    is_continuous: True
    state_dim: 4
    action_dim: 1
checkpoint_path: "checkpoints/InvertedPendulum"
network:
    action_std_init: 0.4
    action_std_decay_rate: 0.03
    min_action_std: 0.1
    action_std_decay_freq: 1e5
    shared_layer: False
    optimizer:
        lr: 3e-4
train: 
    total_timesteps: 1000000
    max_episode_len: 1024
    gamma: 0.99
    tau: 0.95
    ppo:
        loss_type: clip
        optim_epochs: 10
        batch_size: 256    
        eps_clip: 0.2
        coef_value_function: 0.5
        coef_entropy_penalty: 0
        value_clipping: True
    reward_scaler: True
    observation_normalizer: False
    clipping_gradient: True
    scheduler: True
    average_interval: 100
    max_ckpt_count: 3
    advantage_type: 'gae'
    off_policy_buffer_size: 0
    fraction: 0
Kkeirl commented 3 months ago

对于延迟回复,我深表歉意。要运行 Pendulum-v0,您无需单独修改代码。您可以继续为 Pendulum-v0 创建配置,类似于 configs/Ant-v4.yaml。

例:

device: "cpu"
seed: 77
env:
    env_name: "InvertedPendulum-v4"
    num_envs: 8
    is_continuous: True
    state_dim: 4
    action_dim: 1
checkpoint_path: "checkpoints/InvertedPendulum"
network:
    action_std_init: 0.4
    action_std_decay_rate: 0.03
    min_action_std: 0.1
    action_std_decay_freq: 1e5
    shared_layer: False
    optimizer:
        lr: 3e-4
train: 
    total_timesteps: 1000000
    max_episode_len: 1024
    gamma: 0.99
    tau: 0.95
    ppo:
        loss_type: clip
        optim_epochs: 10
        batch_size: 256    
        eps_clip: 0.2
        coef_value_function: 0.5
        coef_entropy_penalty: 0
        value_clipping: True
    reward_scaler: True
    observation_normalizer: False
    clipping_gradient: True
    scheduler: True
    average_interval: 100
    max_ckpt_count: 3
    advantage_type: 'gae'
    off_policy_buffer_size: 0
    fraction: 0

Thank you very much for your reply.