If capturable=True, params and state_steps must be CUDA tensors.

RenieWell commented 11 months ago

非常感谢贵团队能开源这个工作！我尝试从本项目下载的segment-10061305430875486848_1080_000_1100_000_with_camera_labels数据测试streetsurf时缺却遇到了如下报错：

python -u /data/temp_dev/code_single/tools/train.py --config code_single/configs/waymo/streetsurf/withmask_nolidar.230814.yaml 2023-09-08 16:15:28,178-rk0-utils.py#20:kaolin is not installed. OctreeAS / ForestAS disabled. 2023-09-08 16:15:28,945-rk0-occgrid_accel.py#30:vedo not installed. Some of the visualizations are disabled. 2023-09-08 16:15:28,945-rk0-occgrid_forest_accel.py#29:vedo not installed. Some of the visualizations are disabled. => Use cuda devices: [0] => Init Env @ single process: use device_ids = [0] 2023-09-08 16:15:29,100-rk0-train.py#887:=> Experiments dir: logs/streetsurf/seg100613.withmask_nolidar_exp1 2023-09-08 16:15:29,101-rk0-utils.py#840:=> Backing up from ./ to logs/streetsurf/seg100613.withmask_nolidar_exp1/backup... 2023-09-08 16:15:29,157-rk0-utils.py#848:done. 2023-09-08 16:15:29,178-rk0-train.py#909:=> Creating scene_bank... => scenario file saved to logs/streetsurf/seg100613.withmask_nolidar_exp1/scenarios/segment-10061305430875486848_1080_000_1100_000_with_camera_labels.pt => scene bank metadata saved to logs/streetsurf/seg100613.withmask_nolidar_exp1/scenarios/metadata.json 2023-09-08 16:15:30,719-rk0-train.py#917:=> Done creating scene_bank. 2023-09-08 16:15:30,730-rk0-lotd_encoding.py#35:tensorly is not installed. => street using cuboid space 2023-09-08 16:15:30,775-rk0-lotd_cfg.py#129:NGP auto-computed config: layer resolutions: [[87, 67, 26], [120, 93, 37], [166, 128, 51], [230, 178, 70], [318, 246, 97], [440, 340, 134], [608, 470, 186], [841, 650, 257], [1163, 898, 356], [1607, 1241, 492], [2221, 1715, 680], [3069, 2371, 940], [4242, 3276, 1299], [5863, 4528, 1795], [8103, 6258, 2482], [11198, 8649, 3430], [15476, 11953, 4740]] 2023-09-08 16:15:30,775-rk0-lotd_cfg.py#130:NGP auto-computed config: layer types: ['Dense', 'Dense', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash'] 2023-09-08 16:15:30,775-rk0-lotd_cfg.py#131:NGP auto-computed config: layer n_feats: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] 2023-09-08 16:15:30,775-rk0-lotd_cfg.py#132:NGP auto-computed config: expected num_params=33554432; generated: 32586228 [0.97x] 2023-09-08 16:15:30,780-rk0-lotd_cfg.py#189:NGP-4D auto-computed config: layer resolutions: [[53, 41, 16, 4], [73, 56, 23, 6], [100, 78, 31, 8], [138, 107, 43, 11], [191, 148, 59, 15], [264, 204, 81, 21], [364, 282, 112, 28], [503, 389, 155, 39], [696, 537, 213, 54], [961, 742, 295, 74], [1328, 1026, 407, 102], [1835, 1418, 562, 141], [2536, 1959, 777, 195], [3505, 2707, 1074, 269], [4843, 3741, 1484, 371], [6693, 5170, 2050, 513]] 2023-09-08 16:15:30,780-rk0-lotd_cfg.py#190:NGP-4D auto-computed config: layer types: ['Dense', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash', 'Hash'] 2023-09-08 16:15:30,780-rk0-lotd_cfg.py#191:NGP-4D auto-computed config: layer n_feats: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] 2023-09-08 16:15:30,780-rk0-lotd_cfg.py#192:NGP-4D auto-computed config: totally 16006784 parameters [0.95x] 2023-09-08 16:15:30,784-rk0-image_embeddings.py#65:segment-10061305430875486848_1080_000_1100_000_with_camera_labels create image embeddings for ['camera_FRONT', 'camera_FRONT_LEFT', 'camera_FRONT_RIGHT'] 2023-09-08 16:15:30,792-rk0-train.py#944:=> Model structure saved to logs/streetsurf/seg100613.withmask_nolidar_exp1/model.txt 2023-09-08 16:15:30,792-rk0-train.py#962:=> Start loading data, for exp: logs/streetsurf/seg100613.withmask_nolidar_exp1 2023-09-08 16:15:30,793-rk0-train.py#965:=> Done loading data. 2023-09-08 16:15:30,796-rk0-checkpoint.py#74:=> Found ckpts: [] 2023-09-08 16:15:30,799-rk0-train.py#182:=> Start initialize prepcess... => Pretraining SDF...: 100%|██| 1000/1000 [00:20<00:00, 49.20it/s, loss=2.13e-5] 2023-09-08 16:15:51,137-rk0-train.py#204:=> Done initialize prepcess. 2023-09-08 16:15:51,137-rk0-checkpoint.py#41:=> Saving ckpt to logs/streetsurf/seg100613.withmask_nolidar_exp1/ckpts/0.pt 2023-09-08 16:15:51,450-rk0-checkpoint.py#46:Done. 2023-09-08 16:15:51,450-rk0-train.py#1078:=> Start [train], it=0, lr=1e-05, in logs/streetsurf/seg100613.withmask_nolidar_exp1 0%| | 0/12000 [00:00<?, ?it/s] Init OCC: 0%| | 0/4 [00:00<?, ?it/s] /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/utils/tensorboard/summary.py:446: RuntimeWarning: invalid value encountered in cast tensor = (tensor * scale_factor).astype(np.uint8) Error occurred in: logs/streetsurf/seg100613.withmask_nolidar_exp1 0%| | 0/12000 [00:10<?, ?it/s] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /data/weirong/temp_dev/code_single/tools/train.py:1322 in │ │ │ │ 1319 │ │ 1320 if name == "main": │ │ 1321 │ bc = make_parser() │ │ ❱ 1322 │ main_function(bc.parse(print_config=False)) │ │ 1323 │ │ │ │ /data/weirong/temp_dev/code_single/tools/train.py:1305 in main_function │ │ │ │ 1302 │ │ │ │ sys.exit() │ │ 1303 │ │ │ except Exception as e: │ │ 1304 │ │ │ │ print(f"Error occurred in: {exp_dir}") │ │ ❱ 1305 │ │ │ │ raise e │ │ 1306 │ │ │ 1307 │ if is_master(): │ │ 1308 │ │ checkpointio.save(filename=f'final{it:08d}.pt', global_step │ │ │ │ /data/weirong/temp_dev/code_single/tools/train.py:1297 in main_function │ │ │ │ 1294 │ │ while it <= args.training.num_iters and not end: │ │ 1295 │ │ │ try: │ │ 1296 │ │ │ │ # iter_timestamps.append(f"{time.time() - total_start │ │ ❱ 1297 │ │ │ │ train_step() │ │ 1298 │ │ │ except KeyboardInterrupt: │ │ 1299 │ │ │ │ if is_master(): │ │ 1300 │ │ │ │ │ checkpoint_io.save(filename='latest.pt', global_s │ │ │ │ /data/weirong/temp_dev/code_single/tools/train.py:1140 in train_step │ │ │ │ 1137 │ │ │ │ grad_norms = calc_grad_norm(asset_bank) if log_grad │ │ 1138 │ │ │ │ │ │ 1139 │ │ │ │ # optimizer.step() │ │ ❱ 1140 │ │ │ │ scaler_pixel.step(optimizer) │ │ 1141 │ │ │ │ scaler_pixel.update() │ │ 1142 │ │ │ │ scheduler.step(it) # NOTE: important! when world_siz │ │ 1143 │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/cuda/amp/g │ │ rad_scaler.py:341 in step │ │ │ │ 338 │ │ │ │ 339 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No i │ │ 340 │ │ │ │ ❱ 341 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, arg │ │ 342 │ │ │ │ 343 │ │ optimizer_state["stage"] = OptState.STEPPED │ │ 344 │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/cuda/amp/g │ │ rad_scaler.py:288 in _maybe_opt_step │ │ │ │ 285 │ def _maybe_opt_step(self, optimizer, optimizer_state, args, kwa │ │ 286 │ │ retval = None │ │ 287 │ │ if not sum(v.item() for v in optimizer_state["found_inf_per_de │ │ ❱ 288 │ │ │ retval = optimizer.step(*args, kwargs) │ │ 289 │ │ return retval │ │ 290 │ │ │ 291 │ def step(self, optimizer, *args, *kwargs): │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/optim/lr_s │ │ cheduler.py:68 in wrapper │ │ │ │ 65 │ │ │ │ instance = instance_ref() │ │ 66 │ │ │ │ instance._step_count += 1 │ │ 67 │ │ │ │ wrapped = func.get(instance, cls) │ │ ❱ 68 │ │ │ │ return wrapped(args, kwargs) │ │ 69 │ │ │ │ │ 70 │ │ │ # Note that the returned function here is no longer a bou │ │ 71 │ │ │ # so attributes like __func__ and __self__ no longer │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/optim/opti │ │ mizer.py:140 in wrapper │ │ │ │ 137 │ │ │ │ obj, _ = args │ │ 138 │ │ │ │ profile_name = "Optimizer.step#{}.step".format(obj.__c │ │ 139 │ │ │ │ with torch.autograd.profiler.record_function(profile_n │ │ ❱ 140 │ │ │ │ │ out = func(args, *kwargs) │ │ 141 │ │ │ │ │ obj._optimizer_step_code() │ │ 142 │ │ │ │ │ return out │ │ 143 │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/optim/opti │ │ mizer.py:23 in _use_grad │ │ │ │ 20 │ │ prev_grad = torch.is_grad_enabled() │ │ 21 │ │ try: │ │ 22 │ │ │ torch.set_grad_enabled(self.defaults['differentiable']) │ │ ❱ 23 │ │ │ ret = func(self, args, **kwargs) │ │ 24 │ │ finally: │ │ 25 │ │ │ torch.set_grad_enabled(prev_grad) │ │ 26 │ │ return ret │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/optim/adam │ │ .py:234 in step │ │ │ │ 231 │ │ │ │ │ │ raise RuntimeError('requires_grad is not sup │ │ 232 │ │ │ │ │ state_steps.append(state['step']) │ │ 233 │ │ │ │ │ ❱ 234 │ │ │ adam(params_with_grad, │ │ 235 │ │ │ │ grads, │ │ 236 │ │ │ │ exp_avgs, │ │ 237 │ │ │ │ exp_avg_sqs, │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/optim/adam │ │ .py:300 in adam │ │ │ │ 297 │ else: │ │ 298 │ │ func = _single_tensor_adam │ │ 299 │ │ │ ❱ 300 │ func(params, │ │ 301 │ │ grads, │ │ 302 │ │ exp_avgs, │ │ 303 │ │ exp_avg_sqs, │ │ │ │ /data/miniconda3/envs/nerf2mesh/lib/python3.8/site-packages/torch/optim/adam │ │ .py:348 in _single_tensor_adam │ │ │ │ 345 │ │ step_t = state_steps[i] │ │ 346 │ │ │ │ 347 │ │ if capturable: │ │ ❱ 348 │ │ │ assert param.is_cuda and step_t.is_cuda, "If capturable=Tr │ │ 349 │ │ │ │ 350 │ │ # update step │ │ 351 │ │ step_t += 1 │ ╰──────────────────────────────────────────────────────────────────────────────╯ AssertionError: If capturable=True, params and state_steps must be CUDA tensors.

Process finished with exit code 1

我不太确定是哪里出了问题，您能指点一二吗？

liuxinhai commented 11 months ago

I slove this problem by following this comment. Good luck!

zzzxxxttt commented 5 months ago

For those who use pytorch>=1.12.1, problem can be solved by adding capturable=True to line 204 in nr3d_lib/models/utils.py:

    if optimizer_type == 'adam':
        optimizer = optim.Adam(param_groups, capturable=True, **kwargs)

PJLab-ADG / neuralsim

If capturable=True, params and state_steps must be CUDA tensors. #21