Closed madanmch closed 1 year ago
As per https://github.com/JonasSchult/Mask3D/issues/36#issuecomment-1375005213 I removed benchmark_03 folder and training started. After a few seconds, I got the CUDA out of memory issue, and training aborted.
_torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 314.00 MiB (GPU 0; 7.80 GiB total capacity; 5.83 GiB already allocated; 78.12 MiB free; 6.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOCCONF
I even reduced OMP_NUM_THREADS to 1 and CURR_QUERY to 50, but still, I have the same CUDA issue. And tried to export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 to 2048, but it was no use.
Hi!
Which GPU do you use?
Training on 2cm voxels is quite memory intensive (48GB VRAM recommended). You can reduce it to data.voxel_size=0.05
.
Best, Jonas
Thanks Jonas, for the response. I found 3GB of video memory of my GPU using "glxinfo | grep -E -i 'device|memory'" in my Ubuntu terminal. This is very low compared to the 48GB recommended VRAM. I will check any other options to go for higher GPU machines.
Btw, @JonasSchult, what could be the needed minimum VRAM size for inference for the Mask3D model? Thanks
For the best model, we used a voxel size of 2cm using an A40 GPU with 48GB RAM.
Best, Jonas
Hi @JonasSchult, It's excellent work by you and helping others to run their models.
I installed Mask3D code in my Ubuntu 20.02 system with Cuda 11.6 support. And could successfully installed all required python packages.
While I ran the below command (part of stpls3d_benchmark.sh)
export OMP_NUM_THREADS=3 export HYDRA_FULL_ERROR=1
CURR_DBSCAN=12.5 CURR_TOPK=200 CURR_QUERY=160 CURR_SIZE=54 CURR_THRESHOLD=0.01
TRAIN network 1 with voxel size 0.333
python main_instance_segmentation.py \ general.experiment_name="benchmark_03" \ general.project_name="stpls3d" \ data/datasets=stpls3d \ general.num_targets=15 \ data.num_labels=15 \ data.voxel_size=0.333 \ data.num_workers=10 \ data.cache_data=true \ data.cropping_v1=false \ general.reps_per_epoch=100 \ general.checkpoint="checkpoints/stpls3d/stpls3d_benchmark_03.ckpt" \ model.num_queries=${CURR_QUERY} \ general.on_crops=true \ model.config.backbone.target=models.Res16UNet18B \ data.crop_length=${CURR_SIZE} \ general.eval_inner_core=50.0 \ data.train_mode=train_validation
I am getting below stack trace from Wand.ai site.
Traceback (most recent call last): File "/home/spartan/GitHubCode/Mask3D/main_instance_segmentation.py", line 105, in
main()
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
run_and_report(
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
raise ex
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/_internal/utils.py", line 347, in
lambda: hydra.run(
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/core/utils.py", line 128, in run_job
ret.return_value = task_function(task_cfg)
File "/home/spartan/GitHubCode/Mask3D/main_instance_segmentation.py", line 98, in main
train(cfg)
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/hydra/main.py", line 27, in decorated_main
return task_function(cfg_passthrough)
File "/home/spartan/GitHubCode/Mask3D/main_instance_segmentation.py", line 78, in train
runner.fit(model)
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1164, in _run
self._checkpoint_connector.restore_training_state()
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 202, in restore_training_state
self.restore_loops()
File "/home/spartan/anaconda3/envs/mask3ddCl/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 274, in restore_loops
batch_loop.optimizer_loop.optim_progress.optimizer.step.total.completed = self._loaded_checkpoint[
KeyError: 'global_step'
_raise(ex, cause)
File "/home/spartan/anaconda3/envs/Mask3d/lib/python3.10/site-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
File "/home/spartan/anaconda3/envs/Mask3d/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 851, in _create_impl
raise ValidationError(
omegaconf.errors.ValidationError: Error instantiating 'models.mask3d.Mask3D' : Object of unsupported type: 'Res16UNet18B'
full_key:
object_type=None
I have hydra-core, iopath and omegaconf with all required versions from Mask3d requirements.txt. How can I fix this omegaconf error?