Div99 / LISA

(NeurIPS '22) LISA: Learning Interpretable Skill Abstractions - A framework for unsupervised skill learning using Imitation
https://div99.github.io/LISA/
25 stars 5 forks source link

Action loss is not declining on LISA training #1

Open dljzx opened 1 year ago

dljzx commented 1 year ago

Thanks for your work! While i was training LISA model on lorel dataset with 50000 episodes, I found out that the action loss did not efficiently decline while commitment loss declines successfully . So that the total loss remaining about 0.93 did not decrease after 30 steps. So It makes evaluation success keep on zero after 2500 iterations. Can you help me?

dljzx commented 1 year ago

After training with LISA, I tried using lang DT to train with the same dataset. Although action pred loss successfully declined, the success rate still remaining below 0.2 and no upward trend through 2500 iterations. Is that error on my dataset or the code?

image success rate by lang DT

image success rate by lisa

main.py model=traj_option batch_size=1024 option_selector.use_vq=True seed=0 train_dataset.num_trajectories=50000 model.horizon=10 model.K=10 option_selector.num_options=20 env=lorel_sawyer_state warmup_steps=2500 max_iters=2500 trainer.eval_every=100 option_selector.commitment_weight=0.25 option_selector.kmeans_init=True save_interval=100 and here are my hyperparameters

skandavaidyanath commented 1 year ago

Hi, thanks so much for your interest in our work and apologies that you're having trouble with reproducing the results. A couple of pointers that might help you debug this issue:

  1. The success rates in Table 1 of the paper are for the 6 original Lorl instructions without any kind of rephrasing. The evaluation done by the code is over all kinds of rephrasing so this number will be lower than reported in the paper. There are other scripts like viz-lorl.py and viz-lorl.ipynb that evaluate on only the 6 main instructions or you can easily modify the code to get this.
  2. When we used the Lorl data, we cleaned it by removing all the "Do nothing" labeled instructions. This resulted in us having only 40108 trajectories and not 50k. I'm not sure if this should lead to such a big disparity in results though.
  3. Getting a success rate of 0 on this environment is weird because even a random policy should get some success on this tabletop environment. Maybe there's a bug you've introduced somehow?

Here are some wandb plots attached for Lorl with states to get an indication of what the curves should look like:

For Lang DT: Screenshot 2023-02-20 at 3 52 18 PM Screenshot 2023-02-20 at 3 52 29 PM

For LISA: Screenshot 2023-02-20 at 3 53 32 PM Screenshot 2023-02-20 at 3 53 40 PM

From the curves and your numbers, it seems like the Lang DT success rates are similar and the loss numbers are similar. Maybe there's a bug in the evaluation script for LISA that's crept into the code?

Please let us know if this helps and if there's anything else we can do to resolve this issue!

dljzx commented 1 year ago

Hi, thanks so much for your interest in our work and apologies that you're having trouble with reproducing the results. A couple of pointers that might help you debug this issue:

  1. The success rates in Table 1 of the paper are for the 6 original Lorl instructions without any kind of rephrasing. The evaluation done by the code is over all kinds of rephrasing so this number will be lower than reported in the paper. There are other scripts like viz-lorl.py and viz-lorl.ipynb that evaluate on only the 6 main instructions or you can easily modify the code to get this.
  2. When we used the Lorl data, we cleaned it by removing all the "Do nothing" labeled instructions. This resulted in us having only 40108 trajectories and not 50k. I'm not sure if this should lead to such a big disparity in results though.
  3. Getting a success rate of 0 on this environment is weird because even a random policy should get some success on this tabletop environment. Maybe there's a bug you've introduced somehow?

Here are some wandb plots attached for Lorl with states to get an indication of what the curves should look like:

For Lang DT: Screenshot 2023-02-20 at 3 52 18 PM Screenshot 2023-02-20 at 3 52 29 PM

For LISA: Screenshot 2023-02-20 at 3 53 32 PM Screenshot 2023-02-20 at 3 53 40 PM

From the curves and your numbers, it seems like the Lang DT success rates are similar and the loss numbers are similar. Maybe there's a bug in the evaluation script for LISA that's crept into the code?

Please let us know if this helps and if there's anything else we can do to resolve this issue!

Thanks for your reply. Train curves of Lisa are similar so bugs are possibly on evaluation scipy. The only different between Lisa and lang DT is in the get_action function. Can you recheck code of get_action and get_options?

dljzx commented 1 year ago

Additionally, I found that Word Freq matix output by my lisa training is terrible, through Train it fixed onto skillcode 11 for every task image

While I think that maybe VQVAE were not sucessfully trained. Maybe check the Option selector training would fix this problem

skandavaidyanath commented 1 year ago

Ah this picture helps a lot thanks! I think the code looks fine. Can you try a commitment weight of 1 for LoRL please?

dljzx commented 1 year ago

It seems that commitment weight of 1 help a bit. Success rate no more zero but keep declining while training. On step 3 it achieves at about 0.3 success rate while with steps going more the success rate remain under 0.1. Also Word Freq matrix show only 8 and 18 two skills. image It is the image on step 250, Would that be better after more steps?

Also here is my command which I think might not suits your advice? main.py env=lorel_sawyer_obs method=traj_option dt.n_layer=1 dt.n_head=4 option_selector.option_transformer.n_layer=1 option_selector.option_transformer.n_head=4 option_selector.commitment_weight=1.0 option_selector.option_transformer.hidden_size=128 batch_size=1024 seed=2 warmup_steps=5000 trainer.eval_every=50 save_interval=50 max_iters=2500

skandavaidyanath commented 1 year ago

Yeah we ran it for 500 iterations but our options were more spread out by this time. Maybe run it for longer and see how it goes? I think your command looks fine to me. You should get an overall success rate of about 40% at the end. Also just to clarify, in my earlier images, those were for lorl_state since I thought you were running those. It seems like you're running lorl_observation now.

dljzx commented 1 year ago

Yeah we ran it for 500 iterations but our options were more spread out by this time. Maybe run it for longer and see how it goes? I think your command looks fine to me. You should get an overall success rate of about 40% at the end. Also just to clarify, in my earlier images, those were for lorl_state since I thought you were running those. It seems like you're running lorl_observation now.

obs and state are both training on my mechine.While lorel_obs trains much slower than state so that there is no much result on lorel_obs for now. Lorel state receive bad results with low success rate and word freq matrix image image

And here are mine command: main.py env=lorel_sawyer_state method=traj_option dt.n_layer=1 dt.n_head=4 option_selector.option_transformer.n_layer=1 option_selector.option_transformer.n_head=4 option_selector.commitment_weight=1.0 option_selector.option_transformer.hidden_size=128 batch_size=256 seed=19 warmup_steps=2500 max_iters=2500

dljzx commented 1 year ago

If there was not bug in the code , maybe I think there is difference between the lorel sawyer dataset. Can you share your dataset ?