NVlabs / RVT

Official Code for RVT-2 and RVT
https://robotic-view-transformer-2.github.io/
Other
280 stars 34 forks source link

KeyError: 'lang_goal_tokens' #11

Closed LemonWade closed 1 year ago

LemonWade commented 1 year ago

Thank you very much for your work. Below is a bug I encountered while reproducing

GZTNR~H%1T~ZSM}7(9S}D3V

I downloaded and decompressed the data and replay for a single task. Later, due to the deprecation of np.bool in numpy, I replaced all instances of np.bool with np.bool_. When I executed the training code again python train.py --exp_cfg_path configs/all_100.yaml --device 0, I encountered a KeyError: 'lang_goal_tokens'. Did I do something wrong?

configs/all_100.yaml

exp_id: rvt
tasks: slide_block_to_color_target
bs: 3
num_workers: 3
epochs: 15
sample_distribution_mode: task_uniform
peract:
  lr: 1e-4
  warmup_steps: 2000
  optimizer_type: lamb
  lr_cos_dec: True
  transform_augmentation_xyz: [0.125, 0.125, 0.125]
  transform_augmentation_rpy: [0.0, 0.0, 45.0]
rvt:
  place_with_mean: False

logs

(rvt-zzy) root@7708b7cca4e2:/data/zzy/RVT/rvt# python train.py --exp_cfg_path configs/all_100.yaml --device 0              
dict(exp_cfg)={'agent': 'our', 'tasks': 'slide_block_to_color_target', 'exp_id': 'rvt', 'resume': '', 'bs': 3, 'epochs': 15, 'num_workers': 3, 'sample_distribution_mode': 'task_uniform', 'peract': CfgNode({'lambda_weight_l2': 1e-06, 'lr': 0.00030000000000000003, 'optimizer_type': 'lamb', 'warmup_steps': 2000, 'lr_cos_dec': True, 'add_rgc_loss': True, 'num_rotation_classes': 72, 'transform_augmentation': True, 'transform_augmentation_xyz': [0.125, 0.125, 0.125], 'transform_augmentation_rpy': [0.0, 0.0, 45.0]}), 'rvt': CfgNode({'gt_hm_sigma': 1.5, 'img_aug': 0.1, 'place_with_mean': False, 'move_pc_in_bound': True}), 'peract_official': CfgNode({'cfg_path': 'configs/peract_official_config.yaml'})}
Training on 1 tasks: ['slide_block_to_color_target']
[Info] Replay dataset already exists in the disk: replay/replay_train/slide_block_to_color_target
Created Dataset. Time Cost: 0.21861758629480998 minutes
MVT Vars: {'training': True, '_parameters': OrderedDict(), '_buffers': OrderedDict(), '_non_persistent_buffers_set': set(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': OrderedDict(), 'depth': 8, 'img_feat_dim': 3, 'img_size': 220, 'add_proprio': True, 'proprio_dim': 4, 'add_lang': True, 'lang_dim': 512, 'lang_len': 77, 'im_channels': 64, 'img_patch_size': 11, 'final_dim': 64, 'attn_dropout': 0.1, 'decoder_dropout': 0.0, 'self_cross_ver': 1, 'add_corr': True, 'add_pixel_loc': True, 'add_depth': True, 'pe_fix': True}
Start training ...
Rank [0], Epoch [0]: Training on train dataset
  0%|                                                                                                                                                         | 0/53333 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 300, in <module>
    mp.spawn(experiment, args=(cmd_args, devices, port), nprocs=len(devices), join=True)
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/data/zzy/RVT/rvt/train.py", line 260, in experiment
    out = train(agent, train_dataset, TRAINING_ITERATIONS, rank)
  File "/data/zzy/RVT/rvt/train.py", line 54, in train
    raw_batch = next(data_iter)
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/rvt-zzy/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 39, in fetch
    data = next(self.dataset_iter)
  File "/data/zzy/RVT/rvt/libs/YARR/yarr/replay_buffer/wrappers/pytorch_replay_buffer.py", line 41, in _generator
    yield self._replay_buffer.sample_transition_batch(pack_in_dict=True, distribution_mode = self._sample_distribution_mode)
  File "/data/zzy/RVT/rvt/libs/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 803, in sample_transition_batch
    store = self._get_from_disk(
  File "/data/zzy/RVT/rvt/libs/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 456, in _get_from_disk
    store[k][i] = v # NOTE: potential bug here, should % self._replay_capacity
KeyError: 'lang_goal_tokens'
imankgoyal commented 1 year ago

Hi @LemonWade ,

Happy to help. I don't think you did anything wrong but I am unable to reproduce the issue at my end. The code looks fine, so my hunch would be that it is something to do with the data.

I did the following steps:

Here is my folder structure:

tree -L 3                                                                                                                                                                                               130 ↵
.
├── config.py
├── configs
│   ├── all.yaml
│   └── peract_official_config.yaml
├── data
│   └── train
│       ├── slide_block_to_color_target
│       └── slide_block_to_color_target.zip
├── eval_internal.py
├── eval.py
├── libs
│   ├── peract
│   │   ├── agents
│   │   ├── ARM_LICENSE
│   │   ├── conf
│   │   ├── eval.py
│   │   ├── helpers
│   │   ├── LICENSE
│   │   ├── media
│   │   ├── model-card.md
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── run_seed_fn.py
│   │   ├── scripts
│   │   ├── setup.py
│   │   ├── train.py
│   │   └── voxel
│   ├── peract_colab
│   │   ├── peract_colab
│   │   ├── peract_colab.egg-info
│   │   └── setup.py
│   ├── PyRep
│   │   ├── build
│   │   ├── cffi_build
│   │   ├── docs
│   │   ├── examples
│   │   ├── LICENSE
│   │   ├── pyrep
│   │   ├── PyRep.egg-info
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── robot_ttms
│   │   ├── setup.py
│   │   ├── system
│   │   ├── tests
│   │   ├── tools
│   │   └── tutorials
│   ├── RLBench
│   │   ├── examples
│   │   ├── LICENSE
│   │   ├── readme_files
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── rlbench
│   │   ├── rlbench.egg-info
│   │   ├── setup.py
│   │   ├── tests
│   │   ├── tools
│   │   ├── travisci_generate_index.py
│   │   ├── travisci_run_tests.py
│   │   └── tutorials
│   └── YARR
│       ├── LICENSE
│       ├── logo.png
│       ├── README.md
│       ├── requirements.txt
│       ├── setup.py
│       ├── yarr
│       └── yarr.egg-info
├── models
│   ├── peract_official.py
│   ├── __pycache__
│   │   ├── peract_official.cpython-38.pyc
│   │   └── rvt_agent.cpython-38.pyc
│   └── rvt_agent.py
├── mvt
│   ├── attn.py
│   ├── augmentation.py
│   ├── aug_utils.py
│   ├── config.py
│   ├── __init__.py
│   ├── mvt.py
│   ├── mvt_single.py
│   ├── __pycache__
│   │   ├── attn.cpython-38.pyc
│   │   ├── augmentation.cpython-38.pyc
│   │   ├── aug_utils.cpython-38.pyc
│   │   ├── config.cpython-38.pyc
│   │   ├── __init__.cpython-38.pyc
│   │   ├── mvt.cpython-38.pyc
│   │   ├── mvt_single.cpython-38.pyc
│   │   ├── renderer.cpython-38.pyc
│   │   └── utils.cpython-38.pyc
│   ├── renderer.py
│   └── utils.py
├── __pycache__
│   └── config.cpython-38.pyc
├── replay
│   └── replay_train
│       ├── slide_block_to_color_target
│       └── slide_block_to_color_target.tar.xz
├── runs
│   └── rvt_tasks_slide_block_to_color_target
│       ├── args.yaml
│       ├── events.out.tfevents.1691690765.neil
│       ├── events.out.tfevents.1691690946.neil
│       ├── exp_cfg.yaml
│       └── mvt_cfg.yaml
├── train.py
└── utils
    ├── custom_rlbench_env.py
    ├── dataset.py
    ├── ddp_utils.py
    ├── get_dataset.py
    ├── __init__.py
    ├── lr_sched_utils.py
    ├── peract_utils.py
    ├── __pycache__
    │   ├── custom_rlbench_env.cpython-38.pyc
    │   ├── dataset.cpython-38.pyc
    │   ├── ddp_utils.cpython-38.pyc
    │   ├── get_dataset.cpython-38.pyc
    │   ├── __init__.cpython-38.pyc
    │   ├── lr_sched_utils.cpython-38.pyc
    │   ├── peract_utils.cpython-38.pyc
    │   ├── rlbench_planning.cpython-38.pyc
    │   └── rvt_utils.cpython-38.pyc
    ├── rlbench_planning.py
    └── rvt_utils.py

Here is the log:

╰─$ python3 train.py --exp_cfg_path configs/all.yaml --device 0 --exp_cfg_opts "tasks slide_block_to_color_target"                                                                                      
dict(exp_cfg)={'agent': 'our', 'tasks': 'slide_block_to_color_target', 'exp_id': 'rvt_tasks_slide_block_to_color_target', 'resume': '', 'bs': 3, 'epochs': 15, 'num_workers': 3, 'sample_distribution_mode': 'task_uniform', 'peract': CfgNode({'lambda_weight_l2': 1e-06, 'lr': 0.00030000000000000003, 'optimizer_type': 'lamb', 'warmup_steps': 2000, 'lr_cos_dec': True, 'add_rgc_loss': True, 'num_rotation_classes': 72, 'transform_augmentation': True, 'transform_augmentation_xyz': [0.125, 0.125, 0.125], 'transform_augmentation_rpy': [0.0, 0.0, 45.0]}), 'rvt': CfgNode({'gt_hm_sigma': 1.5, 'img_aug': 0.1, 'place_with_mean': False, 'move_pc_in_bound': True}), 'peract_official': CfgNode({'cfg_path': 'configs/peract_official_config.yaml'})}
Training on 1 tasks: ['slide_block_to_color_target']
[Info] Replay dataset already exists in the disk: replay/replay_train/slide_block_to_color_target
Created Dataset. Time Cost: 0.08458356459935507 minutes
MVT Vars: {'training': True, '_parameters': OrderedDict(), '_buffers': OrderedDict(), '_non_persistent_buffers_set': set(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': OrderedDict(), 'depth': 8, 'img_feat_dim': 3, 'img_size': 220, 'add_proprio': True, 'proprio_dim': 4, 'add_lang': True, 'lang_dim': 512, 'lang_len': 77, 'im_channels': 64, 'img_patch_size': 11, 'final_dim': 64, 'attn_dropout': 0.1, 'decoder_dropout': 0.0, 'self_cross_ver': 1, 'add_corr': True, 'add_pixel_loc': True, 'add_depth': True, 'pe_fix': True}
Start training ...
Rank [0], Epoch [0]: Training on train dataset
  0%|                                                                                                                                                                                   | 0/53333 [00:00<?, ?it/s]/home/angoyal/RVT/rvt/models/rvt_agent.py:518: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  trans_aug_range=torch.tensor(self._transform_augmentation_xyz),
  0%|▏                                                                                                                                                                       | 76/53333 [00:41<7:34:39,  1.95it/s]
LemonWade commented 1 year ago

After following your steps to re-download the data, I successfully trained the model. The previous error was likely due to file corruption during my download process. I'm very grateful to you for reproducing the process for me. Thank you again, and the evaluation also ran normally. Thank you.