Open jbarciv opened 5 months ago
Hi @jbarciv thanks for your questions (also following up on your email)! So for the lbc training, it's important that we initialize the student from the teacher's weights. As you noted in your email, the student has a slightly different architecture than the teacher. Namely, the student has a vision based encoder, whereas the teacher has a depth map based encoder. Besides this difference, the remaining model weights can be shared. We load the appropriate weights from the teacher into the student here. Also, alt_ckpt is just used to test the baseline depth map policy here, so you shouldn't use that when training the lbc stage
Also we began doing some real hardware experiments which didn't make the paper. Our codebase for that is here.
Hello Simar,
Thanks a lot for your mails and answers.
The problem remains in the LBC stage. I have cloned the current repo in Ubuntu 24.04 with Cuda 12.2 and Drivers 535.138.01. I am using a laptop GPU RTX 4060 .
I have followed all the setup.sh
steps without any relevant error.
I have set the aliengo_lbc_config
as follows:
class runner(LeggedRobotCfgPPO.runner):
alg = "lbc"
run_name = "debug"
experiment_name = "lbc_aliengo"
max_iterations = 10000
num_test_envs = 30
resume = False
resume_path = "weights/lbc.pt"
teacher_policy = "experiments/obstacles.pt"
I am using this obstacles.pt
teacher model which visually performs very well (see this video). And now I want to train a student with your incredible LBC approach!
The first try results in a small bug:
Traceback (most recent call last):
File "legged_gym/scripts/lbc.py", line 54, in <module>
train(args)
File "legged_gym/scripts/lbc.py", line 42, in train
env, env_cfg = task_registry.make_env(name=args.task, args=args)
File "/home/josep-barbera/Documents/nViNL/legged_gym/utils/task_registry.py", line 123, in make_env
record=record
File "/home/josep-barbera/Documents/nViNL/legged_gym/envs/aliengo/aliengo.py", line 124, in __init__
self.envs[env_idx], self.actor_handles[env_idx], "base"
NameError: name 'env_idx' is not defined
that is here in the ViNL repo. Changing the env_idx
by i
solves the problem.
The second try runs perfectly and starts training (see the command line outputs before training info that follows):
python legged_gym/scripts/lbc.py --task=aliengo_lbc --headless
Importing module 'gym_37' (/home/josep-barbera/Documents/nViNL/submodules/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_37.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/josep-barbera/Documents/nViNL/submodules/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 1.13.1
Device count 1
/home/josep-barbera/Documents/nViNL/submodules/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /home/josep-barbera/.cache/torch_extensions/py37_cu116 as PyTorch extensions root...
Emitting ninja build file /home/josep-barbera/.cache/torch_extensions/py37_cu116/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
Setting seed: 1
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
HEIGHT SAMPLES: [ 0 28 29]
Hor scale: 0.1
vertical scale: 0.005
any_contacts True
crash_freq True
feet_step True
feet_stumble True
ALIENGO INIT
USEDM in LBC Runner: False
(Student) Actor MLP: Sequential(
(0): Linear(in_features=80, out_features=512, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=512, out_features=256, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=256, out_features=128, bias=True)
(5): ELU(alpha=1.0)
(6): Linear(in_features=128, out_features=12, bias=True)
)
(Student) Train Type: lbc
(Student) ENCODER MLP: CNNRNN(
(cnn): SimpleCNN(
(cnn): Sequential(
(0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
(1): ReLU(inplace=True)
(2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
(3): ReLU(inplace=True)
(4): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1))
(5): Flatten(start_dim=1, end_dim=-1)
(6): Linear(in_features=21888, out_features=32, bias=True)
(7): ReLU(inplace=True)
)
)
(rnn): LSTMStateEncoder(
(rnn): LSTM(80, 64, num_layers=2)
)
(rnn_linear): Linear(in_features=64, out_features=32, bias=True)
)
################################################################################
Learning iteration 0/10000
...
Then the weird behavior begins: it is usual that for the first 100 iterations the mean_reward
increases... but around iteration 200 it starts to decrease and decrease... to really small values. I have let it run up to 10000
iterations but the mean_reward
stays close to 0.
Please, could someone try to replicate this? Just clone the repo, use my weights from the previous training step (obstacles) and try to train a student
with the lbc
approach.
On the other hand, if you detect any error in my procedure or have any idea of the possible error... Let me know!!!
Thanks in advance,
jbarciv
That's certainly very odd. Here's a couple tips to help debug.
Hello again.
Some things:
1) What is the purpose of the baseline? How should I use it to debug?
2) I have printed out the items()
within the Teacher
using the following code:
print(50*"=")
for i in torch.load(train_cfg["runner"]["teacher_policy"]):
print(i)
print(50*"=")
for i, j in torch.load(train_cfg["runner"]["teacher_policy"])["model_state_dict"].items():
print(i)
print(50*"=")
which results in this output:
==================================================
model_state_dict
optimizer_state_dict
iter
infos
==================================================
std
actor.0.weight
actor.0.bias
actor.2.weight
actor.2.bias
actor.4.weight
actor.4.bias
actor.6.weight
actor.6.bias
critic.0.weight
critic.0.bias
critic.2.weight
critic.2.bias
critic.4.weight
critic.4.bias
critic.6.weight
critic.6.bias
encoder.encoder.0.weight
encoder.encoder.0.bias
encoder.encoder.2.weight
encoder.encoder.2.bias
encoder.encoder.4.weight
encoder.encoder.4.bias
==================================================
3) Based on the previous point... How should the Student
load the Teacher
's weights?
The current code loads them this way:
if not (alt_ckpt is not None and alt_ckpt != ""):
actorDict = torch.load(train_cfg["runner"]["teacher_policy"])["model_state_dict" ]
dmDict = torch.load(train_cfg["runner"]["teacher_policy"])["model_state_dict"]
newDmDict = {}
for k, v in dmDict.items():
if "encoder" in k:
newk = k[8:]
newDmDict[newk] = v
actor.load_state_dict(actorDict, strict=False)
dm_encoder.load_state_dict(newDmDict)
As you can see the Actor_Critic
from the Teacher
is loaded directly (all: the actor, the critic, the std and the encoder) to the Actor
of the Student
.
Meanwhile, in the Baseline test loads only the actor
and the std
.
# For baseline testing, overwrite the DM encoder and policy
self.use_dm = alt_ckpt is not None and alt_ckpt != ""
if self.use_dm:
print("!!!!!!!! USING A DM BASELINE !!!!!!!!")
print("Loading DM baseline:", alt_ckpt)
loaded_dict = torch.load(alt_ckpt, map_location="cuda")
encoder_weights = {
k[len("encoder.") :]: v
for k, v in loaded_dict["model_state_dict"].items()
if k.startswith("encoder.")
}
policy_weights = {
k: v
for k, v in loaded_dict["model_state_dict"].items()
if k.startswith("actor.") or k == "std"
}
self.dm_encoder.load_state_dict(encoder_weights)
self.actor.load_state_dict(policy_weights)
4) I have run the simulations trying some possible variations in the loading process but all of them results in bad performance around the iteration 200~350.
5) I have compared the lbc_runner.py
with the on_policy_runner.py
and there are (of course) differences that I would like to point out just in case...
On the left the OnPolicyRunner
on the right the LbcRunner
. Numbers 1, 2 and 3 points some differences that maybe someone could help me to understand.
They are:
inference_mode
in the OnPolicyRunner
but not in the LbcRunner
.to(self.device)
after the step
in the OnPolicyRunner
but not in the LbcRunner
if "episode" in infos:
in the LbcRunner
but not in the OnPolicyRunner
.Sorry, I know this is too long. But I wanted to summarize all my progress. If anyone besides @SimarKareer (who has already done a lot for me) could help me with this... (maybe @naokiyokoyama or @yhykid :innocent:...?) please let me know, I will be very grateful. Thanks!
jbarciv
What is the purpose of the baseline? How should I use it to debug?
Let's start with this. Just try to load your teacher model during the lbc stage, and make sure it has the same good performance that you showed in the video. This will help us make sure there's no issue with the environment, training procedure etc.
Hi there!
Regarding Baseline: Thanks for the clarification on the Baseline!
I have set it up with alt_ckpt
:
class runner(LeggedRobotCfgPPO.runner):
alg = "lbc"
run_name = "debug"
experiment_name = "lbc_aliengo"
max_iterations = 10000 # number of policy updates
num_test_envs = 1
resume = False # True for eval, False for train
resume_path = "weights/model_150_9.2.pt"
teacher_policy = "weights/obs.pt"
alt_ckpt = "weights/obs.pt"
I plotted the mean reward vs iterations, and this is the output:
![image](https://github.com/user-attachments/assets/50798281-ef88-4fa4-9c5a-0686a1eb0528)
Hyperparameters Experimentation: Someone suggested experimenting with hyperparameters. Here are the different behaviors observed:
Weight Loading Verification:
I carefully checked the weight loading, and everything seemed correct. Here’s my approach:
In the lbc_runner.py
, I printed the actor.state_dict()
before and after loading.
I observed that the actor
only loads the teacher_actor
and the std
, and those weights were updated properly.
Any feedback is welcome!
jbarciv
Wait, it seems that for the baseline where you load the teacher, the initial reward is ~0. If we correctly loaded both the actor and the depth map encoder, the initial reward should be higher (since you said this teacher policy had good performance). Something seems slightly off about that to me
Dear Simar et al.,
First of all, I would like to thank you for your research. I believe it is very well done and deserves to be studied carefully to learn from your perspectives, methods, and insights.
I am writing to ask you about some details and questions that seem to arise after a first reading of the README and reviewing the code:
In the third training phase, should
python legged_gym/scripts/lbc.py --task=aliengo_lbc
be used instead ofpython legged_gym/scripts/train.py --task=aliengo_lbc
?Additionally, if the "teacher policy" is to be defined from the console, should
alt_ckpt
be used? If I later want to use play to visualize and measure the training results, should I continue usingalt_ckpt
? I understand that it will be necessary to use--resume
, as well as--load_run
and--checkpoint
, but I am not sure what to do regarding the training policy. I will try everything anyway, but could you confirm this?Finally, although the scope of the research did not originally involve implementing it in hardware and therefore it is not mentioned in the README, could you give me any pointers on how the locomotion policy should be exported? And how to make it work outside of Isaac Gym?
That's all. I appreciate your time and dedication in advance.
Best regards,
jbarciv