Training questions and LBC stage

jbarciv commented 5 months ago

Dear Simar et al.,

First of all, I would like to thank you for your research. I believe it is very well done and deserves to be studied carefully to learn from your perspectives, methods, and insights.

I am writing to ask you about some details and questions that seem to arise after a first reading of the README and reviewing the code:

In the third training phase, should python legged_gym/scripts/lbc.py --task=aliengo_lbc be used instead of python legged_gym/scripts/train.py --task=aliengo_lbc?
Additionally, if the "teacher policy" is to be defined from the console, should alt_ckpt be used? If I later want to use play to visualize and measure the training results, should I continue using alt_ckpt? I understand that it will be necessary to use --resume, as well as --load_run and --checkpoint, but I am not sure what to do regarding the training policy. I will try everything anyway, but could you confirm this?
Finally, although the scope of the research did not originally involve implementing it in hardware and therefore it is not mentioned in the README, could you give me any pointers on how the locomotion policy should be exported? And how to make it work outside of Isaac Gym?

That's all. I appreciate your time and dedication in advance.

Best regards,

jbarciv

SimarKareer commented 4 months ago

Hi @jbarciv thanks for your questions (also following up on your email)! So for the lbc training, it's important that we initialize the student from the teacher's weights. As you noted in your email, the student has a slightly different architecture than the teacher. Namely, the student has a vision based encoder, whereas the teacher has a depth map based encoder. Besides this difference, the remaining model weights can be shared. We load the appropriate weights from the teacher into the student here. Also, alt_ckpt is just used to test the baseline depth map policy here, so you shouldn't use that when training the lbc stage

SimarKareer commented 4 months ago

Also we began doing some real hardware experiments which didn't make the paper. Our codebase for that is here.

jbarciv commented 4 months ago

Hello Simar,

Thanks a lot for your mails and answers.

The problem remains in the LBC stage. I have cloned the current repo in Ubuntu 24.04 with Cuda 12.2 and Drivers 535.138.01. I am using a laptop GPU RTX 4060 .

I have followed all the setup.sh steps without any relevant error.

I have set the aliengo_lbc_config as follows:

class runner(LeggedRobotCfgPPO.runner):
        alg = "lbc"
        run_name = "debug"
        experiment_name = "lbc_aliengo"
        max_iterations = 10000  
        num_test_envs = 30
        resume = False
        resume_path = "weights/lbc.pt"
        teacher_policy = "experiments/obstacles.pt"

I am using this obstacles.pt teacher model which visually performs very well (see this video). And now I want to train a student with your incredible LBC approach!

The first try results in a small bug:

Traceback (most recent call last):
  File "legged_gym/scripts/lbc.py", line 54, in <module>
    train(args)
  File "legged_gym/scripts/lbc.py", line 42, in train
    env, env_cfg = task_registry.make_env(name=args.task, args=args)
  File "/home/josep-barbera/Documents/nViNL/legged_gym/utils/task_registry.py", line 123, in make_env
    record=record
  File "/home/josep-barbera/Documents/nViNL/legged_gym/envs/aliengo/aliengo.py", line 124, in __init__
    self.envs[env_idx], self.actor_handles[env_idx], "base"
NameError: name 'env_idx' is not defined

that is here in the ViNL repo. Changing the env_idx by i solves the problem.

The second try runs perfectly and starts training (see the command line outputs before training info that follows):

python legged_gym/scripts/lbc.py --task=aliengo_lbc --headless
Importing module 'gym_37' (/home/josep-barbera/Documents/nViNL/submodules/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_37.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/josep-barbera/Documents/nViNL/submodules/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 1.13.1
Device count 1
/home/josep-barbera/Documents/nViNL/submodules/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /home/josep-barbera/.cache/torch_extensions/py37_cu116 as PyTorch extensions root...
Emitting ninja build file /home/josep-barbera/.cache/torch_extensions/py37_cu116/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
Setting seed: 1
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
HEIGHT SAMPLES:  [ 0 28 29]
Hor scale:  0.1
vertical scale:  0.005
any_contacts True
crash_freq True
feet_step True
feet_stumble True
ALIENGO INIT
USEDM in LBC Runner:  False
(Student) Actor MLP: Sequential(
  (0): Linear(in_features=80, out_features=512, bias=True)
  (1): ELU(alpha=1.0)
  (2): Linear(in_features=512, out_features=256, bias=True)
  (3): ELU(alpha=1.0)
  (4): Linear(in_features=256, out_features=128, bias=True)
  (5): ELU(alpha=1.0)
  (6): Linear(in_features=128, out_features=12, bias=True)
)
(Student) Train Type:  lbc
(Student) ENCODER MLP: CNNRNN(
  (cnn): SimpleCNN(
    (cnn): Sequential(
      (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU(inplace=True)
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1))
      (5): Flatten(start_dim=1, end_dim=-1)
      (6): Linear(in_features=21888, out_features=32, bias=True)
      (7): ReLU(inplace=True)
    )
  )
  (rnn): LSTMStateEncoder(
    (rnn): LSTM(80, 64, num_layers=2)
  )
  (rnn_linear): Linear(in_features=64, out_features=32, bias=True)
)
################################################################################
                       Learning iteration 0/10000                       
...

Then the weird behavior begins: it is usual that for the first 100 iterations the mean_reward increases... but around iteration 200 it starts to decrease and decrease... to really small values. I have let it run up to 10000 iterations but the mean_reward stays close to 0.

Please, could someone try to replicate this? Just clone the repo, use my weights from the previous training step (obstacles) and try to train a student with the lbc approach.

On the other hand, if you detect any error in my procedure or have any idea of the possible error... Let me know!!!

Thanks in advance,

jbarciv

SimarKareer commented 4 months ago

That's certainly very odd. Here's a couple tips to help debug.

First can you ensure that the correct weights are being loaded here. Make sure that the student is loading all the weights from the teacher except the depth map.
Next, try to just run the teacher as a baseline in the lbc.py. I believe if you set alt_ckpt ="" it will do that, based off of this code

jbarciv commented 4 months ago

Hello again.

Some things:

1) What is the purpose of the baseline? How should I use it to debug? 2) I have printed out the items() within the Teacher using the following code:

  print(50*"=")

  for i in torch.load(train_cfg["runner"]["teacher_policy"]):
      print(i)

  print(50*"=")

  for i, j in torch.load(train_cfg["runner"]["teacher_policy"])["model_state_dict"].items():
      print(i)

  print(50*"=")

which results in this output:

  ==================================================
  model_state_dict
  optimizer_state_dict
  iter
  infos
  ==================================================
  std
  actor.0.weight
  actor.0.bias
  actor.2.weight
  actor.2.bias
  actor.4.weight
  actor.4.bias
  actor.6.weight
  actor.6.bias
  critic.0.weight
  critic.0.bias
  critic.2.weight
  critic.2.bias
  critic.4.weight
  critic.4.bias
  critic.6.weight
  critic.6.bias
  encoder.encoder.0.weight
  encoder.encoder.0.bias
  encoder.encoder.2.weight
  encoder.encoder.2.bias
  encoder.encoder.4.weight
  encoder.encoder.4.bias
  ==================================================

3) Based on the previous point... How should the Student load the Teacher's weights? The current code loads them this way:

  if not (alt_ckpt is not None and alt_ckpt != ""):
        actorDict = torch.load(train_cfg["runner"]["teacher_policy"])["model_state_dict" ]
        dmDict = torch.load(train_cfg["runner"]["teacher_policy"])["model_state_dict"]

        newDmDict = {}
        for k, v in dmDict.items():
            if "encoder" in k:
                newk = k[8:]
                newDmDict[newk] = v

        actor.load_state_dict(actorDict, strict=False)
        dm_encoder.load_state_dict(newDmDict)

As you can see the Actor_Critic from the Teacher is loaded directly (all: the actor, the critic, the std and the encoder) to the Actor of the Student. Meanwhile, in the Baseline test loads only the actor and the std.

# For baseline testing, overwrite the DM encoder and policy
self.use_dm = alt_ckpt is not None and alt_ckpt != ""
if self.use_dm:
    print("!!!!!!!! USING A DM BASELINE !!!!!!!!")
    print("Loading DM baseline:", alt_ckpt)
    loaded_dict = torch.load(alt_ckpt, map_location="cuda")
    encoder_weights = {
        k[len("encoder.") :]: v
        for k, v in loaded_dict["model_state_dict"].items()
        if k.startswith("encoder.")
    }
    policy_weights = {
        k: v
        for k, v in loaded_dict["model_state_dict"].items()
        if k.startswith("actor.") or k == "std"
    }
    self.dm_encoder.load_state_dict(encoder_weights)
    self.actor.load_state_dict(policy_weights)

4) I have run the simulations trying some possible variations in the loading process but all of them results in bad performance around the iteration 200~350.

5) I have compared the lbc_runner.py with the on_policy_runner.py and there are (of course) differences that I would like to point out just in case... On the left the OnPolicyRunner on the right the LbcRunner. Numbers 1, 2 and 3 points some differences that maybe someone could help me to understand. They are:

The use of the inference_mode in the OnPolicyRunner but not in the LbcRunner
The use of the .to(self.device) after the step in the OnPolicyRunner but not in the LbcRunner
The identation within the if "episode" in infos: in the LbcRunner but not in the OnPolicyRunner.

Sorry, I know this is too long. But I wanted to summarize all my progress. If anyone besides @SimarKareer (who has already done a lot for me) could help me with this... (maybe @naokiyokoyama or @yhykid :innocent:...?) please let me know, I will be very grateful. Thanks!

jbarciv

SimarKareer commented 4 months ago

What is the purpose of the baseline? How should I use it to debug?

Let's start with this. Just try to load your teacher model during the lbc stage, and make sure it has the same good performance that you showed in the video. This will help us make sure there's no issue with the environment, training procedure etc.

jbarciv commented 4 months ago

Hi there!

Regarding Baseline: Thanks for the clarification on the Baseline!

I have set it up with alt_ckpt:


class runner(LeggedRobotCfgPPO.runner):
   alg = "lbc"
   run_name = "debug"
   experiment_name = "lbc_aliengo"
   max_iterations = 10000  # number of policy updates
   num_test_envs = 1
   resume = False  # True for eval, False for train
   resume_path = "weights/model_150_9.2.pt"
   teacher_policy = "weights/obs.pt"
   alt_ckpt = "weights/obs.pt"
I plotted the mean reward vs iterations, and this is the output:
![image](https://github.com/user-attachments/assets/50798281-ef88-4fa4-9c5a-0686a1eb0528)

Hyperparameters Experimentation: Someone suggested experimenting with hyperparameters. Here are the different behaviors observed:
Weight Loading Verification: I carefully checked the weight loading, and everything seemed correct. Here’s my approach: In the lbc_runner.py, I printed the actor.state_dict() before and after loading. I observed that the actor only loads the teacher_actor and the std, and those weights were updated properly.

Any feedback is welcome!

jbarciv

SimarKareer commented 4 months ago

Wait, it seems that for the baseline where you load the teacher, the initial reward is ~0. If we correctly loaded both the actor and the depth map encoder, the initial reward should be higher (since you said this teacher policy had good performance). Something seems slightly off about that to me

SimarKareer / ViNL

Training questions and LBC stage #7