Lifelong-Robot-Learning / LIBERO

Benchmarking Knowledge Transfer in Lifelong Robot Learning
MIT License
170 stars 28 forks source link

TypeError: h5py objects cannot be pickled #19

Open BrightMoonStar opened 3 months ago

BrightMoonStar commented 3 months ago

Hi~ When I set num_worker to a number greater than 0, the following error will appear. How to solve this problem? I know that num_worker=0 can run, but I want to further improve the running speed. Thank you very much!

[info] start training on task 0
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/a/Videos/bin/boot/LIBERO/libero/lifelong/main.py", line 223, in main
    s_fwd, l_fwd = algo.learn_one_task(
  File "/home/a/Videos/bin/boot/LIBERO/libero/lifelong/algos/base.py", line 175, in learn_one_task
    for (idx, data) in enumerate(train_dataloader):
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
    return self._get_iterator()
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
    w.start()
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/a/anaconda3/envs/LIBERO/lib/python3.8/site-packages/h5py/_hl/base.py", line 368, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Cranial-XIX commented 3 months ago

Hi, could you please provide the full command to reproduce this error?

raymondyu5 commented 3 months ago

Hi, I am also receiving this problem when I run:

export CUDA_VISIBLE_DEVICES=0 && export MUJOCO_EGL_DEVICE_ID=0 && python lifelong/main.py seed=0 benchmark_name=libero_90 policy=bc_rnn_policy lifelong=base

I receive this:

=================== Lifelong Benchmark Information  ===================
 Name: libero_90
 # Tasks: 90
    - Task 1:
        close the top drawer of the cabinet
    - Task 2:
        close the top drawer of the cabinet and put the black bowl on top of it
    - Task 3:
        put the black bowl in the top drawer of the cabinet
    - Task 4:
        put the butter at the back in the top drawer of the cabinet and close it
    - Task 5:
        put the butter at the front in the top drawer of the cabinet and close it
    - Task 6:
        put the chocolate pudding in the top drawer of the cabinet and close it
    - Task 7:
        open the bottom drawer of the cabinet
    - Task 8:
        open the top drawer of the cabinet
    - Task 9:
        open the top drawer of the cabinet and put the bowl in it
    - Task 10:
        put the black bowl on the plate
    - Task 11:
        put the black bowl on top of the cabinet
    - Task 12:
        open the top drawer of the cabinet
    - Task 13:
        put the black bowl at the back on the plate
    - Task 14:
        put the black bowl at the front on the plate
    - Task 15:
        put the middle black bowl on the plate
    - Task 16:
        put the middle black bowl on top of the cabinet
    - Task 17:
        stack the black bowl at the front on the black bowl in the middle
    - Task 18:
        stack the middle black bowl on the back black bowl
    - Task 19:
        put the frying pan on the stove
    - Task 20:
        put the moka pot on the stove
    - Task 21:
        turn on the stove
    - Task 22:
        turn on the stove and put the frying pan on it
    - Task 23:
        close the bottom drawer of the cabinet
    - Task 24:
        close the bottom drawer of the cabinet and open the top drawer
    - Task 25:
        put the black bowl in the bottom drawer of the cabinet
    - Task 26:
        put the black bowl on top of the cabinet
    - Task 27:
        put the wine bottle in the bottom drawer of the cabinet
    - Task 28:
        put the wine bottle on the wine rack
    - Task 29:
        close the top drawer of the cabinet
    - Task 30:
        put the black bowl in the top drawer of the cabinet
    - Task 31:
        put the black bowl on the plate
    - Task 32:
        put the black bowl on top of the cabinet
    - Task 33:
        put the ketchup in the top drawer of the cabinet
    - Task 34:
        close the microwave
    - Task 35:
        put the yellow and white mug to the front of the white mug
    - Task 36:
        open the microwave
    - Task 37:
        put the white bowl on the plate
    - Task 38:
        put the white bowl to the right of the plate
    - Task 39:
        put the right moka pot on the stove
    - Task 40:
        turn off the stove
    - Task 41:
        put the frying pan on the cabinet shelf
    - Task 42:
        put the frying pan on top of the cabinet
    - Task 43:
        put the frying pan under the cabinet shelf
    - Task 44:
        put the white bowl on top of the cabinet
    - Task 45:
        turn on the stove
    - Task 46:
        turn on the stove and put the frying pan on it
    - Task 47:
        pick up the alphabet soup and put it in the basket
    - Task 48:
        pick up the cream cheese box and put it in the basket
    - Task 49:
        pick up the ketchup and put it in the basket
    - Task 50:
        pick up the tomato sauce and put it in the basket
    - Task 51:
        pick up the alphabet soup and put it in the basket
    - Task 52:
        pick up the butter and put it in the basket
    - Task 53:
        pick up the milk and put it in the basket
    - Task 54:
        pick up the orange juice and put it in the basket
    - Task 55:
        pick up the tomato sauce and put it in the basket
    - Task 56:
        pick up the alphabet soup and put it in the tray
    - Task 57:
        pick up the butter and put it in the tray
    - Task 58:
        pick up the cream cheese and put it in the tray
    - Task 59:
        pick up the ketchup and put it in the tray
    - Task 60:
        pick up the tomato sauce and put it in the tray
    - Task 61:
        pick up the black bowl on the left and put it in the tray
    - Task 62:
        pick up the chocolate pudding and put it in the tray
    - Task 63:
        pick up the salad dressing and put it in the tray
    - Task 64:
        stack the left bowl on the right bowl and place them in the tray
    - Task 65:
        stack the right bowl on the left bowl and place them in the tray
    - Task 66:
        put the red mug on the left plate
    - Task 67:
        put the red mug on the right plate
    - Task 68:
        put the white mug on the left plate
    - Task 69:
        put the yellow and white mug on the right plate
    - Task 70:
        put the chocolate pudding to the left of the plate
    - Task 71:
        put the chocolate pudding to the right of the plate
    - Task 72:
        put the red mug on the plate
    - Task 73:
        put the white mug on the plate
    - Task 74:
        pick up the book and place it in the front compartment of the caddy
    - Task 75:
        pick up the book and place it in the left compartment of the caddy
    - Task 76:
        pick up the book and place it in the right compartment of the caddy
    - Task 77:
        pick up the yellow and white mug and place it to the right of the caddy
    - Task 78:
        pick up the book and place it in the back compartment of the caddy
    - Task 79:
        pick up the book and place it in the front compartment of the caddy
    - Task 80:
        pick up the book and place it in the left compartment of the caddy
    - Task 81:
        pick up the book and place it in the right compartment of the caddy
    - Task 82:
        pick up the book and place it in the front compartment of the caddy
    - Task 83:
        pick up the book and place it in the left compartment of the caddy
    - Task 84:
        pick up the book and place it in the right compartment of the caddy
    - Task 85:
        pick up the red mug and place it to the right of the caddy
    - Task 86:
        pick up the white mug and place it to the right of the caddy
    - Task 87:
        pick up the book in the middle and place it on the cabinet shelf
    - Task 88:
        pick up the book on the left and place it on top of the shelf
    - Task 89:
        pick up the book on the right and place it on the cabinet shelf
    - Task 90:
        pick up the book on the right and place it under the cabinet shelf
 # demonstrations: (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50) (50)
 # sequences: (3828) (10662) (6208) (9440) (9070) (8969) (7157) (4736) (10010) (6628) (6811) (3765) (5538) (6910) (5393) (6821) (6415) (5730) (9979) (7563) (4645) (14013) (5832) (11773) (6643) (7910) (6376) (12093) (3762) (6107) (6665) (7576) (10539) (10160) (6282) (7414) (9702) (6946) (10588) (8795) (9457) (9272) (8612) (7519) (5871) (13047) (6939) (7426) (8665) (8088) (7034) (6609) (5875) (8383) (5306) (6004) (5069) (7697) (7986) (5571) (6176) (7660) (5970) (10771) (11734) (7239) (7032) (5327) (5377) (6195) (4473) (6668) (7907) (8742) (7938) (7831) (6405) (7336) (7582) (6104) (7115) (8738) (7310) (6774) (5896) (5158) (8055) (8547) (5111) (5988)
=======================================================================

/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[info] start lifelong learning with algo Sequential
/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/robomimic/utils/dataset.py:516: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  pad_mask = pad_mask[:, None].astype(np.bool)
/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/torch/nn/modules/rnn.py:761: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at  ../aten/src/ATen/native/cudnn/RNN.cpp:926.)
  result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
[info] policy has 13.5 GFLOPs and 19.1 MParams

[info] start training on task 0
Error executing job with overrides: ['seed=0', 'benchmark_name=libero_90', 'policy=bc_rnn_policy', 'lifelong=base']
Traceback (most recent call last):
  File "lifelong/main.py", line 219, in main
    s_fwd, l_fwd = algo.learn_one_task(
  File "/home/raymond/LIBERO/libero/lifelong/algos/base.py", line 175, in learn_one_task
    for (idx, data) in enumerate(train_dataloader):
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __iter__
    self._iterator = self._get_iterator()
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 927, in __init__
    w.start()
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/raymond/anaconda3/envs/libero/lib/python3.8/site-packages/h5py/_hl/base.py", line 370, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Please let me know, thanks!

bit0123 commented 2 months ago

Hi @Cranial-XIX Thanks for the interesting paper and releasing the code. I am facing the same issue ("TypeError: h5py objects cannot be pickled") while running with num_worker=4. Although the code works with num_worker=0 with some additional transformation of image input in the policy file, The loss becomes negative after the first epoch for different algorithm. Consequently, the accuracy is 0. Could you please comment on that. Thanks

modanesh commented 2 months ago

@Cranial-XIX Any solutions?

modanesh commented 2 months ago

Tried all configurations of algo, policy, and benchmark and still got the same error.

raymondyu5 commented 2 months ago

Try adding these two lines to venv.py. This took me like two days to figure out, so I hope this helps! if multiprocessing.get_start_method(allow_none=True) != "spawn": multiprocessing.set_start_method("spawn", force=True)

modanesh commented 2 months ago

@raymondyu5 Thanks for the response. Where should I exactly add these lines?

Btw, from my understanding, venv.py file is not used when running the training (libero/lifelong/main.py) which is what I want to do. It's used in the evaluation. However, I can try to see if it's actually helpful or not.

bit0123 commented 2 months ago

Hi @BrightMoonStar, @raymondyu5 and @Cranial-XIX Could you please comment on the following observation using ER algorithm. "The code works with num_worker=0, however the loss becomes negative after the first epoch for different algorithm. Consequently, succes and the AoC is 0." Thanks for your response.

wangair commented 2 months ago

Hi! I met the same problem.I set num_worker=0, but it didn't work for me. How can i solve this error?Thank you very much.When i set: num_worker=0 I received this:

ValueError: persistent_workers option needs num_workers > 0

BrightMoonStar commented 1 month ago

你好@BrightMoonStar,@raymondyu5和@Cranial-XIX 您能否使用 ER 算法对以下观察进行评论。 “代码在 num_worker=0 的情况下有效,但是对于不同的算法,在第一个 epoch 之后损失变为负值。因此,成功且 AoC 为 0。” 感谢您的回复。

I meet the same problem

BrightMoonStar commented 1 month ago

你好@BrightMoonStar,@raymondyu5和@Cranial-XIX 您能否使用 ER 算法对以下观察进行评论。 “代码在 num_worker=0 的情况下有效,但是对于不同的算法,在第一个 epoch 之后损失变为负值。因此,成功且 AoC 为 0。” 感谢您的回复。

Have you solved it?Thanks for your reply!

pengzhi1998 commented 1 month ago

Hi! I met the same problem.I set num_worker=0, but it didn't work for me. How can i solve this error?Thank you very much.When i set: num_worker=0 I received this:

ValueError: persistent_workers option needs num_workers > 0

Hi, I commented two lines when creating the dataloader in base.py:

        train_dataloader = DataLoader(
            dataset,
            batch_size=self.cfg.train.batch_size,
            # num_workers=self.cfg.train.num_workers,
            sampler=RandomSampler(dataset),
            # persistent_workers=False,
        )
pengzhi1998 commented 4 weeks ago

Hi! I met the same problem.I set num_worker=0, but it didn't work for me. How can i solve this error?Thank you very much.When i set: num_worker=0 I received this:

ValueError: persistent_workers option needs num_workers > 0

Hi, I commented two lines when creating the dataloader in base.py:

        train_dataloader = DataLoader(
            dataset,
            batch_size=self.cfg.train.batch_size,
            # num_workers=self.cfg.train.num_workers,
            sampler=RandomSampler(dataset),
            # persistent_workers=False,
        )

Dear all,

I tried to remove the multiprocessing for training the first task in LIBERO, and it worked well. But when training the second task, the same problem occurred but due to a different reason:

Traceback (most recent call last):
  File "lifelong/main.py", line 219, in main
    s_fwd, l_fwd = algo.learn_one_task(
  File "/workspace/LIBERO/libero/lifelong/algos/base.py", line 170, in learn_one_task
    loss = self.observe(data)
  File "/workspace/LIBERO/libero/lifelong/algos/er.py", line 73, in observe
    buf_data = next(self.buffer)
  File "/workspace/LIBERO/libero/lifelong/algos/er.py", line 17, in cycle
    for data in dl:
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __iter__
    self._iterator = self._get_iterator()
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 927, in __init__
    w.start()
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/h5py/_hl/base.py", line 370, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

It seems when running experience reply algorithm, this cycle function would also create multiple processes for loading data from previous task (though I have set num_workers as 0, which confuses me most).

Besides, I noticed Robomimic also made use of multiprocessing for training with data as hdf5 files. While their implementations are very similar to LIBERO's, but didn't encounter this error, which is also confusing.

May I have your insights about this issue? Look forward to your reply! @Cranial-XIX @zhuyifengzju

Cranial-XIX commented 4 weeks ago

Can you comment the

if multiprocessing.get_start_method(allow_none=True) != "spawn":
    multiprocessing.set_start_method("spawn", force=True) 

From 270-271 in libero/lifelong/main.py and try again? Try keeping everything else the same as in the HEAD first. It works fine on my side.

pengzhi1998 commented 4 weeks ago

Thank you so much for your reply!

Your solution worked! However, if I comment out the two lines, the issue reappears (https://github.com/Lifelong-Robot-Learning/LIBERO/issues/3).

I would first try downgrading the Cuda/Driver version on the server, and see how it goes.

Thank you so much again!