UT-Austin-RPL / Lotus

19 stars 0 forks source link

Request for Detailed Hyperparameters in LOTUS Experiments #6

Open BrightMoonStar opened 3 months ago

BrightMoonStar commented 3 months ago

Dear Dr. Weikang Wan and Team,

I recently came across your fascinating work on the LOTUS algorithm, as detailed in your paper "LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery."

Your approach to lifelong robot learning through unsupervised skill discovery is truly impressive and offers significant insights into continual imitation learning for robot manipulation. I am particularly interested in replicating and building upon your experiments as part of my research.

However, I noticed that the paper does not provide specific details on some of the experimental hyperparameters, such as the learning rate, number of epochs, and batch size used during training. These details are crucial for ensuring that my replication is as accurate as possible.

Could you kindly provide the following details:

The learning rate(s) used for training the models. The number of epochs each model was trained for. The batch size used during training. Any other relevant hyperparameters or settings that were critical to the performance of the LOTUS algorithm. I greatly appreciate your time and assistance. Your work is a significant contribution to the field, and having these details would be immensely helpful for my research.

Thank you very much for your support and I look forward to your response.

Best regards

wkwan7 commented 3 months ago

Hi @BrightMoonStar, thanks for your interest in our work! For the hyperparameters you mentioned, you can find them in configs, we use these provided training scripts in our paper experiments.

BrightMoonStar commented 3 months ago

Thank you for your reply. I strictly followed the instructions of each step in your project and repeated it many times. I found that all evaluation indicators including success rate and Aoc are always 0. I also found that LIBERO has this problem , as shown below https://github.com/Lifelong-Robot-Learning/LIBERO/issues/21 . I really can't find where the problem is.

BrightMoonStar commented 3 months ago

At first I thought n_epochs: 50 might not be enough in lotus/configs/train/default.yaml n_epochs: 50, so I set n_epochs=1000. The training and evaluation logs are as follows, but all the results of success Rate and Aoc are still 0. Thank you again. output.log

pengzhi1998 commented 3 months ago

Dear Weikang,

Thank you for open-sourcing this great repo!

I'm wondering how did you tackle the challenge when opening hdf5 files with multiprocessing for LIBERO? It seems many people encountered this same issue: https://github.com/Lifelong-Robot-Learning/LIBERO/issues/19#issue-2237952918. May we have your suggestions? @wkwan7

Thank you for your attention and precious time. Look forward to your reply!!

Best regards, Pengzhi

wkwan7 commented 3 months ago

Hi @BrightMoonStar , can you try the default parameters (e.g., n_epochs: 50) and post the your output log shere? btw, I recommend to use wandb which can show more detailed logs.

wkwan7 commented 3 months ago

Hi @pengzhi1998, if the default dataloader setting not works for you, you can try this:

train_dataloader = DataLoader(
    dataset,
    batch_size=self.cfg.train.batch_size,
    num_workers=0, #self.cfg.train.num_workers,
    sampler=RandomSampler(dataset),
    # persistent_workers=True,
)

I don't think this will significantly increase the training time.

pengzhi1998 commented 3 months ago

Thank you Weikang for your reply!

Yes I have tried it and worked well when training and evaluating on the first task in LIBERO. However, when training the second task, the same problem occurred but due to a different reason:

Traceback (most recent call last):
  File "lifelong/main.py", line 219, in main
    s_fwd, l_fwd = algo.learn_one_task(
  File "/workspace/LIBERO/libero/lifelong/algos/base.py", line 170, in learn_one_task
    loss = self.observe(data)
  File "/workspace/LIBERO/libero/lifelong/algos/er.py", line 73, in observe
    buf_data = next(self.buffer)
  File "/workspace/LIBERO/libero/lifelong/algos/er.py", line 17, in cycle
    for data in dl:
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __iter__
    self._iterator = self._get_iterator()
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 927, in __init__
    w.start()
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/h5py/_hl/base.py", line 370, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

It seems when running experience reply algorithm, this cycle function would also create multiple processes for loading data from previous task (though I have set num_workers as 0, which confuses me most).

Besides, I noticed Robomimic also made use of multiprocessing for training with data as hdf5 files. While their implementations are very similar to LIBERO's, but didn't encounter this error, which is also confusing.

May I have some of your insights about these issues? Thank you so much again!!

wkwan7 commented 3 months ago

Hi @pengzhi1998, when I run ER using the Lotus codebase, it doesn't seem to have the issue you mentioned. The command I used is as follows:

export CUDA_VISIBLE_DEVICES=0 && \
export MUJOCO_EGL_DEVICE_ID=0 && \
python lotus/lifelong/main_old.py seed=0 \
                               benchmark_name=LIBERO_OBJECT \
                               policy=bc_transformer_policy \
                               lifelong=er

Maybe you can try using Lotus codebase to see if you still have the issue.

BrightMoonStar commented 3 months ago

Hi @BrightMoonStar , can you try the default parameters (e.g., n_epochs: 50) and post the your output log shere? btw, I recommend to use wandb which can show more detailed logs.

Hi , this is the wandb log with n_epoch=50 offline-run-20240702_001934-vaark4lu.zip

pengzhi1998 commented 2 weeks ago

@BrightMoonStar Hi, did you solve this problem (success rate is always around 0) at last?