Unable to Save JIT after training ACT

After setting up and crosschecking teleop_hand.py, it successfully streams the 3D hands as expected.

Moving onto the Training Guide, we setup the dataset from the provided drive and processed it successfully.

After training ACT, this is the output we got:

python imitate_episodes.py --policy_class ACT --kl_weight 10 --chunk_size 60 --hidden_dim 512 --batch_size 45 --dim_feedforward 3200 --num_epochs 50000 --lr 5e-5 --seed 0 --taskid 00 --exptid 01-sample-expt

Task name: 00-can-sorting

wandb: Currently logged in as: ayaans1804 (ayaans1804-nottingham-trent-university). Use wandb login --relogin to force relogin wandb: wandb version 0.17.7 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.3 wandb: Run data is saved locally in ../data/logs/wandb/run-20240821_202622-8p5d9thq wandb: Run wandb offline to turn off syncing. wandb: Syncing run 01-sample-expt wandb: ⭐️ View project at https://wandb.ai/ayaans1804-nottingham-trent-university/television wandb: 🚀 View run at https://wandb.ai/ayaans1804-nottingham-trent-university/television/runs/8p5d9thq

Data from: /home/robot/Desktop/TeleVision/data/recordings/00-can-sorting/processed

Train episodes: 9, Val episodes: 1 /home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 24 worker processes in total. Our suggested max number of worker in current system is 20, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Using cache found in /home/robot/.cache/torch/hub/facebookresearch_dinov2_main /home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/swiglu_ffn.py:51: UserWarning: xFormers is not available (SwiGLU) warnings.warn("xFormers is not available (SwiGLU)") /home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/attention.py:33: UserWarning: xFormers is not available (Attention) warnings.warn("xFormers is not available (Attention)") /home/robot/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/block.py:40: UserWarning: xFormers is not available (Block) warnings.warn("xFormers is not available (Block)") number of parameters: 94.75M KL Weight 10 0%| | 0/50000 [00:00<?, ?it/s] Epoch 0 Val loss: 83.19358 val/l1: 0.878 val/kl: 8.232 val/loss: 83.194 /home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 24 worker processes in total. Our suggested max number of worker in current system is 20, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( 0%| | 0/50000 [00:27<?, ?it/s] Traceback (most recent call last): File "imitate_episodes.py", line 367, in main(args) File "imitate_episodes.py", line 131, in main best_ckpt_info = train_bc(train_dataloader, val_dataloader, config) File "imitate_episodes.py", line 241, in train_bc forward_dict = forward_pass(data, policy) File "imitate_episodes.py", line 173, in forward_pass return policy(qpos_data, image_data, action_data, is_pad) # TODO remove None File "/home/robot/Desktop/TeleVision/act/policy.py", line 58, in call a_hat, is_pad_hat, (mu, logvar) = self.model(qpos, image, env_state, actions, is_pad) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/home/robot/Desktop/TeleVision/act/detr/models/detr_vae.py", line 149, in forward hs = self.transformer(src, None, self.query_embed.weight, pos, latent_input, proprio_input, self.additional_pos_embed.weight)[0] File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 73, in forward memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 94, in forward output = layer(output, src_mask=mask, File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 201, in forward return self.forward_post(src, src_mask, src_key_padding_mask, pos) File "/home/robot/Desktop/TeleVision/act/detr/models/transformer.py", line 176, in forward_post src2 = self.linear2(self.dropout(self.activation(self.linear1(src)))) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/nn/functional.py", line 1500, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 388.00 MiB. GPU wandb: | 0.016 MB of 0.016 MB uploaded wandb: Run history: wandb: val/kl ▁ wandb: val/l1 ▁ wandb: val/loss ▁ wandb: wandb: Run summary: wandb: val/kl 8.23156 wandb: val/l1 0.87797 wandb: val/loss 83.19358 wandb: wandb: 🚀 View run 01-sample-expt at: https://wandb.ai/ayaans1804-nottingham-trent-university/television/runs/8p5d9thq wandb: ⭐️ View project at: https://wandb.ai/ayaans1804-nottingham-trent-university/television wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ../data/logs/wandb/run-20240821_202622-8p5d9thq/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

And this is the output when we try to save JIT:

Task name: 00-can-sorting

Data from: /home/robot/Desktop/TeleVision/data/recordings/00-can-sorting/processed

Resuming from /home/robot/Desktop/TeleVision/data/logs/00-can-sorting/01-sample-expt/policy_epoch_25000_seed_0.ckpt

Traceback (most recent call last): File "imitate_episodes.py", line 367, in main(args) File "imitate_episodes.py", line 128, in main save_jit(config) File "imitate_episodes.py", line 317, in save_jit policy, ckpt_name, epoch = load_ckpt(policy, exp_dir, config['resume_ckpt']) File "imitate_episodes.py", line 304, in load_ckpt policy.load_state_dict(torch.load(resume_path)) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/serialization.py", line 997, in load with _open_file_like(f, 'rb') as opened_file: File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/serialization.py", line 444, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/robot/anaconda3/envs/tv/lib/python3.8/site-packages/torch/serialization.py", line 425, in init super().init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: '/home/robot/Desktop/TeleVision/data/logs/00-can-sorting/01-sample-expt/policy_epoch_25000_seed_0.ckpt'

OpenTeleVision / TeleVision

Unable to Save JIT after training ACT #29