ahayler / s4c

BSD 2-Clause "Simplified" License
52 stars 2 forks source link

AssertionError: Gather function not implemented for CPU tensors #6

Closed PuTeGuo closed 1 month ago

PuTeGuo commented 2 months ago

Hello! Thank you very much for such a new and excellent work! I have a question to ask you. when I run the training file train.py, there will be an error: "AssertionError: Gather function not implemented for CPU tensors".

However, I don't know what went wrong. I followed the steps in your README step by step. I am using a dual-card machine. the GPUs are NVIDIA RTX 4090, and the memory of a single card is 24GB, hope to get your reply!

ahayler commented 2 months ago

Hey, Thank you for your interest in our work. It is a little hard to debug your error with the limited information you provided. Could you check whether the GPUs are really available in your conda/micomamba environment? (https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu).

All the best, Adrian

PuTeGuo commented 2 months ago
image

Hi! Thank you very much for replying to my question!

I checked again whether my environment can use GPUs normally. As shown in the figure, the usage of GPUs should be normal.

And I will provide you with more detailed error information below:

(QDiffusionPy3.7) gpu@gpu-Server:/data/GPT/s4c-main$ python train.py -cn exp_kitti_360 Tue Aug 27 15:21:27 2024
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:1A:00.0 Off | N/A | | 0% 49C P8 26W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:68:00.0 On | N/A | | 0% 48C P8 36W / 350W | 40MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1180 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1180 G /usr/lib/xorg/Xorg 38MiB | +-----------------------------------------------------------------------------+ +++ Inference Setup Complete +++ 2024-08-27 15:21:28,121 ignite.distributed.launcher.Parallel INFO: - Run '<function training at 0x7ff267a6c830>' in 1 processes 2024-08-27 15:21:28,124 kitti_360 INFO: Run kitti_360 2024-08-27 15:21:28,124 kitti_360 INFO: - PyTorch version: 1.13.0+cu117 2024-08-27 15:21:28,124 kitti_360 INFO: - Ignite version: 0.5.0.post2 2024-08-27 15:21:28,142 kitti_360 INFO: - GPU Device: NVIDIA GeForce RTX 3090 2024-08-27 15:21:28,142 kitti_360 INFO: - CUDA version: 11.7 2024-08-27 15:21:28,143 kitti_360 INFO: - CUDNN version: 8500 2024-08-27 15:21:28,143 kitti_360 INFO:

2024-08-27 15:21:28,144 kitti_360 INFO: Configuration: 2024-08-27 15:21:28,145 kitti_360 INFO: name: kitti_360 2024-08-27 15:21:28,145 kitti_360 INFO: model: bts 2024-08-27 15:21:28,145 kitti_360 INFO: seed: 0 2024-08-27 15:21:28,145 kitti_360 INFO: output_path: out/kitti_360 2024-08-27 15:21:28,145 kitti_360 INFO: batch_size: 1 2024-08-27 15:21:28,145 kitti_360 INFO: num_workers: 4 2024-08-27 15:21:28,145 kitti_360 INFO: eval_use_iters: True 2024-08-27 15:21:28,145 kitti_360 INFO: vis_use_iters: True 2024-08-27 15:21:28,145 kitti_360 INFO: validate_every: 1000 2024-08-27 15:21:28,145 kitti_360 INFO: visualize_every: 500 2024-08-27 15:21:28,145 kitti_360 INFO: log_every_iters: 1 2024-08-27 15:21:28,145 kitti_360 INFO: log_tb_train_every_iters: -1 2024-08-27 15:21:28,145 kitti_360 INFO: log_tb_val_every_iters: -1 2024-08-27 15:21:28,145 kitti_360 INFO: log_tb_vis_every_iters: 1 2024-08-27 15:21:28,145 kitti_360 INFO: checkpoint_every: 500 2024-08-27 15:21:28,145 kitti_360 INFO: resume_from: None 2024-08-27 15:21:28,145 kitti_360 INFO: loss_during_validation: False 2024-08-27 15:21:28,145 kitti_360 INFO: num_epochs: 60 2024-08-27 15:21:28,145 kitti_360 INFO: stop_iteration: None 2024-08-27 15:21:28,145 kitti_360 INFO: learning_rate: 0.0001 2024-08-27 15:21:28,145 kitti_360 INFO: warmup_steps: 10000 2024-08-27 15:21:28,145 kitti_360 INFO: decay_rate: 0.5 2024-08-27 15:21:28,145 kitti_360 INFO: decay_steps: 100000 2024-08-27 15:21:28,145 kitti_360 INFO: num_steps: 100000 2024-08-27 15:21:28,145 kitti_360 INFO: backend: None 2024-08-27 15:21:28,145 kitti_360 INFO: nproc_per_node: None 2024-08-27 15:21:28,145 kitti_360 INFO: with_amp: False 2024-08-27 15:21:28,145 kitti_360 INFO: data: {'type': 'KITTI_360', 'data_path': '/data/GPT/semantic scene completion/KITTI-360', 'data_segmentation_path': '/data/GPT/s4c-main/panoptic_deeplab_R101_os32_cityscapes_hr', 'pose_path': '/data/GPT/semantic scene completion/KITTI-360/data_poses', 'split_path': 'datasets/kitti_360/splits/sscbench', 'image_size': [192, 640], 'data_stereo': True, 'data_fc': 2, 'fisheye_rotation': [0, -15], 'fisheye_offset': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40], 'is_preprocessed': True, 'constrain_to_datapoints': True, 'data_segmentation': True, 'color_aug': True, 'segmentation_mode': 'panoptic_deeplab'} 2024-08-27 15:21:28,146 kitti_360 INFO: use_backbone: True 2024-08-27 15:21:28,146 kitti_360 INFO: segmentation_mode: panoptic_deeplab 2024-08-27 15:21:28,146 kitti_360 INFO: save_best: {'metric': 'abs_rel', 'sign': -1} 2024-08-27 15:21:28,146 kitti_360 INFO: model_conf: {'arch': 'BTSNet', 'use_code': True, 'prediction_mode': 'default', 'code': {'num_freqs': 6, 'freq_factor': 1.5, 'include_input': True}, 'encoder': {'type': 'monodepth2', 'freeze': False, 'pretrained': True, 'resnet_layers': 50, 'num_ch_dec': [32, 32, 64, 128, 256], 'd_out': 64}, 'mlp_coarse': {'type': 'resnet', 'n_blocks': 0, 'd_hidden': 64}, 'mlp_fine': {'type': 'empty', 'n_blocks': 1, 'd_hidden': 128}, 'mlp_segmentation': {'type': 'resnet', 'n_blocks': 0, 'd_hidden': 64}, 'z_near': 3, 'z_far': 80, 'inv_z': True, 'n_frames_encoder': 1, 'n_frames_render': 2, 'frame_sample_mode': 'kitti360-mono', 'sample_mode': 'patch', 'patch_size': 8, 'ray_batch_size': 4096, 'flip_augmentation': True, 'learn_empty': False, 'code_mode': 'z', 'segmentation_mode': 'panoptic_deeplab'} 2024-08-27 15:21:28,146 kitti_360 INFO: loss: {'criterion': 'l1+ssim', 'invalid_policy': 'weight_guided', 'lambda_edge_aware_smoothness': 0.001, 'lambda_segmentation': 0.02, 'lambda_density_entropy': 0, 'segmentation_class_weights': {0: 1, 1: 10, 2: 1, 3: 1, 4: 1, 5: 10, 6: 5, 7: 10, 8: 1, 9: 1, 10: 1, 11: 5, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1}} 2024-08-27 15:21:28,146 kitti_360 INFO: scheduler: {'type': 'step', 'step_size': 120000, 'gamma': 0.1} 2024-08-27 15:21:28,146 kitti_360 INFO: renderer: {'n_coarse': 64, 'n_fine': 0, 'n_fine_depth': 0, 'depth_std': 1.0, 'sched': [], 'white_bkgd': False, 'lindisp': True, 'hard_alpha_cap': True, 'eval_batch_size': 200000} 2024-08-27 15:21:28,146 kitti_360 INFO:

2024-08-27 15:21:28,147 kitti_360 INFO: Output path: out/kitti_360/kitti_360_backend-None-1_20240827-152128 Using maximum datapoint as last image of sequence. Using maximum datapoint as last image of sequence. 2024-08-27 15:21:31,491 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<datasets.kitti_360.': {'batch_size': 1, 'num_workers': 4, 'shuffle': True, 'drop_last': True, 'pin_memory': True} 2024-08-27 15:21:31,492 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<datasets.kitti_360.': {'batch_size': 1, 'num_workers': 4, 'shuffle': False, 'pin_memory': True} 2024-08-27 15:21:31,492 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<datasets.kitti_360.': {'batch_size': 1, 'num_workers': 4, 'shuffle': False, 'pin_memory': True} 2024-08-27 15:21:31,492 kitti_360 INFO: Dataset length: Train: 96902, Test: 256 /home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:136: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead. f"Using {sequence_to_str(tuple(keyword_only_kwargs.keys()), separate_last='and ')} as positional " /home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1. You can also use weights=ResNet50_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Using linear displacement rays 2024-08-27 15:21:34,264 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model 2024-08-27 15:21:34,268 kitti_360 INFO: Model parameters: 34899644 2024-08-27 15:21:36,103 kitti_360 INFO: Engine run starting with max_epochs=60. 2024-08-27 15:21:41,754 kitti_360 ERROR: Current run is terminating due to exception: Gather function not implemented for CPU tensors 2024-08-27 15:21:41,808 kitti_360 ERROR: Engine run is terminating due to exception: Gather function not implemented for CPU tensors 2024-08-27 15:21:41,808 kitti_360 ERROR: Traceback (most recent call last): File "/data/GPT/s4c-main/utils/base_trainer.py", line 221, in base_training trainer.run(train_loader, max_epochs=config["num_epochs"]) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 889, in run return self._internal_run() File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 932, in _internal_run return next(self._internal_run_generator) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 990, in _internal_run_as_gen self._handle_exception(e) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 644, in _handle_exception raise e File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 956, in _internal_run_as_gen epoch_time_taken += yield from self._run_once_on_dataset_as_gen() File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 1096, in _run_once_on_dataset_as_gen self._handle_exception(e) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 644, in _handle_exception raise e File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 1077, in _run_once_on_dataset_as_gen self.state.output = self._process_function(self, self.state.batch) File "/data/GPT/s4c-main/utils/base_trainer.py", line 294, in train_step data = model(data) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 172, in forward return self.gather(outputs, self.output_device) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 184, in gather return gather(outputs, output_device, dim=self.dim) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 86, in gather res = gather_map(outputs) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather_map for k in out) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in for k in out) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 71, in gather_map return Gather.apply(target_device, dim, outputs) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 57, in forward 'Gather function not implemented for CPU tensors' AssertionError: Gather function not implemented for CPU tensors Error executing job with overrides: [] Traceback (most recent call last): File "train.py", line 36, in main() File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/main.py", line 99, in decorated_main config_name=config_name, File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/_internal/utils.py", line 401, in _run_hydra overrides=overrides, File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/_internal/utils.py", line 458, in _run_app lambda: hydra.run( File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/_internal/utils.py", line 461, in overrides=overrides, File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "train.py", line 32, in main parallel.run(training, config) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/distributed/launcher.py", line 316, in run func(local_rank, args, kwargs) File "/data/GPT/s4c-main/models/bts/trainer.py", line 442, in training return base_training(local_rank, config, get_dataflow, initialize, get_metrics, visualize) File "/data/GPT/s4c-main/utils/base_trainer.py", line 224, in base_training raise e File "/data/GPT/s4c-main/utils/base_trainer.py", line 221, in base_training trainer.run(train_loader, max_epochs=config["num_epochs"]) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 889, in run return self._internal_run() File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 932, in _internal_run return next(self._internal_run_generator) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 990, in _internal_run_as_gen self._handle_exception(e) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 644, in _handle_exception raise e File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 956, in _internal_run_as_gen epoch_time_taken += yield from self._run_once_on_dataset_as_gen() File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 1096, in _run_once_on_dataset_as_gen self._handle_exception(e) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 644, in _handle_exception raise e File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 1077, in _run_once_on_dataset_as_gen self.state.output = self._process_function(self, self.state.batch) File "/data/GPT/s4c-main/utils/base_trainer.py", line 294, in train_step data = model(data) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 172, in forward return self.gather(outputs, self.output_device) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 184, in gather return gather(outputs, output_device, dim=self.dim) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 86, in gather res = gather_map(outputs) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather_map for k in out) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in for k in out) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 71, in gather_map return Gather.apply(target_device, dim, outputs) File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 57, in forward 'Gather function not implemented for CPU tensors' AssertionError: Gather function not implemented for CPU tensors

**Hope to get your reply! Thank you very much!

Best wishes**

PuTeGuo commented 2 months ago

In order to facilitate viewing, I took a screenshot of the running output of the terminal:

CleanShot 2024-08-27 at 15 43 19@2x

Best wishes!

ahayler commented 2 months ago

Given the error it seems like some tensor should be mapped to the GPU, but isn't. I would try e.g. setting a breakpoint here:

File "/data/GPT/s4c-main/utils/base_trainer.py", line 294, in train_step data = model(data)

and checking which tensors are not on the GPU (and try mapping them to the GPU).

I could also imagine that this is caused by your dual GPU setup. You could try using only one GPU setting:

import os os.environ["CUDA_VISIBLE_DEVICES"] = "0"

At the top of your training script.

PuTeGuo commented 2 months ago

Thank you very much for your reply!

I just tried to add the setting of using a single GPU in the header of the train.py file as you suggested, but I still get the same error.

I will further try another way to set up breakpoint as you suggested!

Thank you sincerely!

PuTeGuo commented 2 months ago

Hello!

I did it again step by step as instructed in README, and still reported this error. Then I used the only one GPU setting according to your suggestion:

import os os.environ[“CUDA_VISIBLE_DEVICES”]=“0”

The code can run normally, but only a single card is used for training, which affects the training speed (In addition, my single GPU had only 24GB of memory, and the training code couldn't run even if I set the batch size to 1). I hope to be able to train with multiple GPUs, so I set it to a dual GPUs setting:

import os os.environ[“CUDA_VISIBLE_DEVICES”]=“0,1”

As a result, I got the same error report as before!

I noticed that I had no problem with the multi GPUs training using the code from BTS .

Can you help me to solve the error reported by the multi-card operation? Thank you very much !!!

Hope to get your reply!

best wishes!

JSestak commented 2 months ago

Hello! I had the same problem as you (one of) and I solved it by adding following lines in the end of BTSWrapper forward() function in bts/trainer.py: cpu_keys = [k for k, v in data.items() if isinstance(v, torch.Tensor) and v.device == torch.device("cpu")] for k in cpu_keys: data[k] = data[k].cuda() return data

PuTeGuo commented 2 months ago

Hello! Thank you very much for your reply!

This really makes the code work! However, I noticed that after adding the code you suggested, my code only uses one GPU to run (it seems that the default is to use GPU 0) and cannot use multiple GPUs to run. Do you have any way to solve it?

image

best wishes!

JSestak commented 2 months ago

I decided just to reduce complexity of model to be able to run with only 1 gpu (similar setup as you , 2x 24 gb). At the start of training, it seems like both of my GPU's are doing work but shortly afterwards, all the work goes to one gpu. Good luck with your work.

PuTeGuo commented 2 months ago

Thank you very much! Your help means a lot to me!

Good luck with your work!