AlignmentResearch / vlmrm

MIT License
44 stars 12 forks source link

Unable to reproduce results #7

Open ruziniuuuuu opened 3 weeks ago

ruziniuuuuu commented 3 weeks ago

I've successfully set up the environment following the repository instructions, but I'm having trouble reproducing the results mentioned in the paper. (which means that the model learns nothing using your ClipReward)

Clip Model used: ViT-g-14/laion2b_s34b_b88k

I've tested with two environments:

  1. Humanoid-v4
  2. CartPole-v1

Configuration used for Humanoid-v4:

env_name: Humanoid-v4 # RL environment name
base_path: runs/training # Base path to save logs and checkpoints
seed: 42 # Seed for reproducibility
description: Humanoid training using CLIP reward
tags: # Wandb tags
  - training
  - humanoid
  - CLIP
reward:
  name: clip
  pretrained_model: ViT-g-14/laion2b_s34b_b88k # CLIP model name
  # CLIP batch size
  # per synchronous inference step.
  # Batch size must be divisible by n_workers (GPU count)
  # so that it can be shared among workers, and must be a divisor
  # of n_envs * episode_length so that all batches can be of the
  # same size (no support for variable batch size as of now.)
  batch_size: 1600
  alpha: 0.5 # Alpha value of Baseline CLIP (CO-RELATE)
  target_prompts: # Description of the goal state
    - a humanoid robot running
  baseline_prompts: # Description of the environment
    - a humanoid robot
  # Path to pre-saved model weights. When executing multiple runs,
  # mount a volume to this path to avoid downloading the model
  # weights multiple times.
  cache_dir: .cache
rl:
  policy_name: MlpPolicy
  n_steps: 100000 # Total number of simulation steps to be collected.
  n_envs_per_worker: 2 # Number of environments per worker (GPU)
  episode_length: 200 # Desired episode length
  learning_starts: 100 # Number of env steps to collect before training
  train_freq: 200 # Number of collected env steps between training iterations
  batch_size: 64 # SAC buffer sample size per gradient step
  gradient_steps: 1 # Number of samples to collect from the buffer per training step
  tau: 0.005 # SAC target network update rate
  gamma: 0.99 # SAC discount factor
  learning_rate: 3e-4 # SAC optimizer learning rate
logging:
  checkpoint_freq: 10000 # Number of env steps between checkpoints
  video_freq: 10000 # Number of env steps between videos

Configuration used for CartPole-v1:

env_name: CartPole-v1 # RL environment name
base_path: runs/training # Base path to save logs and checkpoints
seed: 42 # Seed for reproducibility
description: CartPole training using CLIP reward
tags: # Wandb tags
  - training
  - cartpole
  - CLIP
reward:
  name: clip
  pretrained_model: ViT-g-14/laion2b_s34b_b88k # CLIP model name
  # pretrained_model: CLIP-ViT-bigG-14-laion2B-39B-b160k # CLIP model name
  # CLIP batch size
  # per synchronous inference step.
  # Batch size must be divisible by n_workers (GPU count)
  # so that it can be shared among workers, and must be a divisor
  # of n_envs * episode_length so that all batches can be of the
  # same size (no support for variable batch size as of now.)
  batch_size: 1600
  alpha: 0.5 # Alpha value of Baseline CLIP (CO-RELATE)
  target_prompts: # Description of the goal state
    - a pole vertically upright on top of the cart
  baseline_prompts: # Description of the environment
    - a pole
  # Path to pre-saved model weights. When executing multiple runs,
  # mount a volume to this path to avoid downloading the model
  # weights multiple times.
  cache_dir: .cache
rl:
  policy_name: MlpPolicy
  n_steps: 100000 # Total number of simulation steps to be collected.
  n_envs_per_worker: 2 # Number of environments per worker (GPU)
  episode_length: 200 # Desired episode length
  learning_starts: 100 # Number of env steps to collect before training
  train_freq: 200 # Number of collected env steps between training iterations
  batch_size: 64 # SAC buffer sample size per gradient step
  gradient_steps: 1 # Number of samples to collect from the buffer per training step
  tau: 0.005 # SAC target network update rate
  gamma: 0.99 # SAC discount factor
  learning_rate: 3e-4 # SAC optimizer learning rate
logging:
  checkpoint_freq: 10000 # Number of env steps between checkpoints
  video_freq: 10000 # Number of env steps between videos

Could you please provide some example configs so I could reproduce your result successfully?