Closed PuTeGuo closed 1 month ago
Hey, Thank you for your interest in our work. It is a little hard to debug your error with the limited information you provided. Could you check whether the GPUs are really available in your conda/micomamba environment? (https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu).
All the best, Adrian
Hi! Thank you very much for replying to my question!
I checked again whether my environment can use GPUs normally. As shown in the figure, the usage of GPUs should be normal.
And I will provide you with more detailed error information below:
(QDiffusionPy3.7) gpu@gpu-Server:/data/GPT/s4c-main$ python train.py -cn exp_kitti_360
Tue Aug 27 15:21:27 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1A:00.0 Off | N/A |
| 0% 49C P8 26W / 350W | 5MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:68:00.0 On | N/A |
| 0% 48C P8 36W / 350W | 40MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1180 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1180 G /usr/lib/xorg/Xorg 38MiB | +-----------------------------------------------------------------------------+ +++ Inference Setup Complete +++ 2024-08-27 15:21:28,121 ignite.distributed.launcher.Parallel INFO: - Run '<function training at 0x7ff267a6c830>' in 1 processes 2024-08-27 15:21:28,124 kitti_360 INFO: Run kitti_360 2024-08-27 15:21:28,124 kitti_360 INFO: - PyTorch version: 1.13.0+cu117 2024-08-27 15:21:28,124 kitti_360 INFO: - Ignite version: 0.5.0.post2 2024-08-27 15:21:28,142 kitti_360 INFO: - GPU Device: NVIDIA GeForce RTX 3090 2024-08-27 15:21:28,142 kitti_360 INFO: - CUDA version: 11.7 2024-08-27 15:21:28,143 kitti_360 INFO: - CUDNN version: 8500 2024-08-27 15:21:28,143 kitti_360 INFO:
2024-08-27 15:21:28,144 kitti_360 INFO: Configuration: 2024-08-27 15:21:28,145 kitti_360 INFO: name: kitti_360 2024-08-27 15:21:28,145 kitti_360 INFO: model: bts 2024-08-27 15:21:28,145 kitti_360 INFO: seed: 0 2024-08-27 15:21:28,145 kitti_360 INFO: output_path: out/kitti_360 2024-08-27 15:21:28,145 kitti_360 INFO: batch_size: 1 2024-08-27 15:21:28,145 kitti_360 INFO: num_workers: 4 2024-08-27 15:21:28,145 kitti_360 INFO: eval_use_iters: True 2024-08-27 15:21:28,145 kitti_360 INFO: vis_use_iters: True 2024-08-27 15:21:28,145 kitti_360 INFO: validate_every: 1000 2024-08-27 15:21:28,145 kitti_360 INFO: visualize_every: 500 2024-08-27 15:21:28,145 kitti_360 INFO: log_every_iters: 1 2024-08-27 15:21:28,145 kitti_360 INFO: log_tb_train_every_iters: -1 2024-08-27 15:21:28,145 kitti_360 INFO: log_tb_val_every_iters: -1 2024-08-27 15:21:28,145 kitti_360 INFO: log_tb_vis_every_iters: 1 2024-08-27 15:21:28,145 kitti_360 INFO: checkpoint_every: 500 2024-08-27 15:21:28,145 kitti_360 INFO: resume_from: None 2024-08-27 15:21:28,145 kitti_360 INFO: loss_during_validation: False 2024-08-27 15:21:28,145 kitti_360 INFO: num_epochs: 60 2024-08-27 15:21:28,145 kitti_360 INFO: stop_iteration: None 2024-08-27 15:21:28,145 kitti_360 INFO: learning_rate: 0.0001 2024-08-27 15:21:28,145 kitti_360 INFO: warmup_steps: 10000 2024-08-27 15:21:28,145 kitti_360 INFO: decay_rate: 0.5 2024-08-27 15:21:28,145 kitti_360 INFO: decay_steps: 100000 2024-08-27 15:21:28,145 kitti_360 INFO: num_steps: 100000 2024-08-27 15:21:28,145 kitti_360 INFO: backend: None 2024-08-27 15:21:28,145 kitti_360 INFO: nproc_per_node: None 2024-08-27 15:21:28,145 kitti_360 INFO: with_amp: False 2024-08-27 15:21:28,145 kitti_360 INFO: data: {'type': 'KITTI_360', 'data_path': '/data/GPT/semantic scene completion/KITTI-360', 'data_segmentation_path': '/data/GPT/s4c-main/panoptic_deeplab_R101_os32_cityscapes_hr', 'pose_path': '/data/GPT/semantic scene completion/KITTI-360/data_poses', 'split_path': 'datasets/kitti_360/splits/sscbench', 'image_size': [192, 640], 'data_stereo': True, 'data_fc': 2, 'fisheye_rotation': [0, -15], 'fisheye_offset': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40], 'is_preprocessed': True, 'constrain_to_datapoints': True, 'data_segmentation': True, 'color_aug': True, 'segmentation_mode': 'panoptic_deeplab'} 2024-08-27 15:21:28,146 kitti_360 INFO: use_backbone: True 2024-08-27 15:21:28,146 kitti_360 INFO: segmentation_mode: panoptic_deeplab 2024-08-27 15:21:28,146 kitti_360 INFO: save_best: {'metric': 'abs_rel', 'sign': -1} 2024-08-27 15:21:28,146 kitti_360 INFO: model_conf: {'arch': 'BTSNet', 'use_code': True, 'prediction_mode': 'default', 'code': {'num_freqs': 6, 'freq_factor': 1.5, 'include_input': True}, 'encoder': {'type': 'monodepth2', 'freeze': False, 'pretrained': True, 'resnet_layers': 50, 'num_ch_dec': [32, 32, 64, 128, 256], 'd_out': 64}, 'mlp_coarse': {'type': 'resnet', 'n_blocks': 0, 'd_hidden': 64}, 'mlp_fine': {'type': 'empty', 'n_blocks': 1, 'd_hidden': 128}, 'mlp_segmentation': {'type': 'resnet', 'n_blocks': 0, 'd_hidden': 64}, 'z_near': 3, 'z_far': 80, 'inv_z': True, 'n_frames_encoder': 1, 'n_frames_render': 2, 'frame_sample_mode': 'kitti360-mono', 'sample_mode': 'patch', 'patch_size': 8, 'ray_batch_size': 4096, 'flip_augmentation': True, 'learn_empty': False, 'code_mode': 'z', 'segmentation_mode': 'panoptic_deeplab'} 2024-08-27 15:21:28,146 kitti_360 INFO: loss: {'criterion': 'l1+ssim', 'invalid_policy': 'weight_guided', 'lambda_edge_aware_smoothness': 0.001, 'lambda_segmentation': 0.02, 'lambda_density_entropy': 0, 'segmentation_class_weights': {0: 1, 1: 10, 2: 1, 3: 1, 4: 1, 5: 10, 6: 5, 7: 10, 8: 1, 9: 1, 10: 1, 11: 5, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1}} 2024-08-27 15:21:28,146 kitti_360 INFO: scheduler: {'type': 'step', 'step_size': 120000, 'gamma': 0.1} 2024-08-27 15:21:28,146 kitti_360 INFO: renderer: {'n_coarse': 64, 'n_fine': 0, 'n_fine_depth': 0, 'depth_std': 1.0, 'sched': [], 'white_bkgd': False, 'lindisp': True, 'hard_alpha_cap': True, 'eval_batch_size': 200000} 2024-08-27 15:21:28,146 kitti_360 INFO:
2024-08-27 15:21:28,147 kitti_360 INFO: Output path: out/kitti_360/kitti_360_backend-None-1_20240827-152128
Using maximum datapoint as last image of sequence.
Using maximum datapoint as last image of sequence.
2024-08-27 15:21:31,491 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<datasets.kitti_360.':
{'batch_size': 1, 'num_workers': 4, 'shuffle': True, 'drop_last': True, 'pin_memory': True}
2024-08-27 15:21:31,492 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<datasets.kitti_360.':
{'batch_size': 1, 'num_workers': 4, 'shuffle': False, 'pin_memory': True}
2024-08-27 15:21:31,492 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<datasets.kitti_360.':
{'batch_size': 1, 'num_workers': 4, 'shuffle': False, 'pin_memory': True}
2024-08-27 15:21:31,492 kitti_360 INFO: Dataset length: Train: 96902, Test: 256
/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:136: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
f"Using {sequence_to_str(tuple(keyword_only_kwargs.keys()), separate_last='and ')} as positional "
/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1
. You can also use weights=ResNet50_Weights.DEFAULT
to get the most up-to-date weights.
warnings.warn(msg)
Using linear displacement rays
2024-08-27 15:21:34,264 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
2024-08-27 15:21:34,268 kitti_360 INFO: Model parameters: 34899644
2024-08-27 15:21:36,103 kitti_360 INFO: Engine run starting with max_epochs=60.
2024-08-27 15:21:41,754 kitti_360 ERROR: Current run is terminating due to exception: Gather function not implemented for CPU tensors
2024-08-27 15:21:41,808 kitti_360 ERROR: Engine run is terminating due to exception: Gather function not implemented for CPU tensors
2024-08-27 15:21:41,808 kitti_360 ERROR:
Traceback (most recent call last):
File "/data/GPT/s4c-main/utils/base_trainer.py", line 221, in base_training
trainer.run(train_loader, max_epochs=config["num_epochs"])
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 889, in run
return self._internal_run()
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 932, in _internal_run
return next(self._internal_run_generator)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 990, in _internal_run_as_gen
self._handle_exception(e)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 644, in _handle_exception
raise e
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 956, in _internal_run_as_gen
epoch_time_taken += yield from self._run_once_on_dataset_as_gen()
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 1096, in _run_once_on_dataset_as_gen
self._handle_exception(e)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 644, in _handle_exception
raise e
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/ignite/engine/engine.py", line 1077, in _run_once_on_dataset_as_gen
self.state.output = self._process_function(self, self.state.batch)
File "/data/GPT/s4c-main/utils/base_trainer.py", line 294, in train_step
data = model(data)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, kwargs)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 172, in forward
return self.gather(outputs, self.output_device)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 184, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 86, in gather
res = gather_map(outputs)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather_map
for k in out)
File "/home/gpu/anaconda3/envs/QDiffusionPy3.7/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in
**Hope to get your reply! Thank you very much!
Best wishes**
In order to facilitate viewing, I took a screenshot of the running output of the terminal:
Best wishes!
Given the error it seems like some tensor should be mapped to the GPU, but isn't. I would try e.g. setting a breakpoint here:
File "/data/GPT/s4c-main/utils/base_trainer.py", line 294, in train_step data = model(data)
and checking which tensors are not on the GPU (and try mapping them to the GPU).
I could also imagine that this is caused by your dual GPU setup. You could try using only one GPU setting:
import os os.environ["CUDA_VISIBLE_DEVICES"] = "0"
At the top of your training script.
Thank you very much for your reply!
I just tried to add the setting of using a single GPU in the header of the train.py file as you suggested, but I still get the same error.
I will further try another way to set up breakpoint as you suggested!
Thank you sincerely!
Hello!
I did it again step by step as instructed in README, and still reported this error. Then I used the only one GPU setting according to your suggestion:
import os os.environ[“CUDA_VISIBLE_DEVICES”]=“0”
The code can run normally, but only a single card is used for training, which affects the training speed (In addition, my single GPU had only 24GB of memory, and the training code couldn't run even if I set the batch size to 1). I hope to be able to train with multiple GPUs, so I set it to a dual GPUs setting:
import os os.environ[“CUDA_VISIBLE_DEVICES”]=“0,1”
As a result, I got the same error report as before!
I noticed that I had no problem with the multi GPUs training using the code from BTS .
Can you help me to solve the error reported by the multi-card operation? Thank you very much !!!
Hope to get your reply!
best wishes!
Hello! I had the same problem as you (one of) and I solved it by adding following lines in the end of BTSWrapper forward() function in bts/trainer.py: cpu_keys = [k for k, v in data.items() if isinstance(v, torch.Tensor) and v.device == torch.device("cpu")] for k in cpu_keys: data[k] = data[k].cuda() return data
Hello! Thank you very much for your reply!
This really makes the code work! However, I noticed that after adding the code you suggested, my code only uses one GPU to run (it seems that the default is to use GPU 0) and cannot use multiple GPUs to run. Do you have any way to solve it?
best wishes!
I decided just to reduce complexity of model to be able to run with only 1 gpu (similar setup as you , 2x 24 gb). At the start of training, it seems like both of my GPU's are doing work but shortly afterwards, all the work goes to one gpu. Good luck with your work.
Thank you very much! Your help means a lot to me!
Good luck with your work!
Hello! Thank you very much for such a new and excellent work! I have a question to ask you. when I run the training file train.py, there will be an error: "AssertionError: Gather function not implemented for CPU tensors".
However, I don't know what went wrong. I followed the steps in your README step by step. I am using a dual-card machine. the GPUs are NVIDIA RTX 4090, and the memory of a single card is 24GB, hope to get your reply!