EPFL-VILAB / MultiMAE

MultiMAE: Multi-modal Multi-task Masked Autoencoders, ECCV 2022
https://multimae.epfl.ch
Other
533 stars 61 forks source link

Facing issues in pretraning the code on custom dataset #12

Closed hellfire504 closed 1 year ago

hellfire504 commented 1 year ago

Hi,

I am trying to pretrain the code on Celeb-HQ dataset and I sucessfully created respective grayscale depth maps(PNG) and grayscale segmentation(PNG) for pretraining. However, when i try to train "OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_pretraining_multimae.py --config cfgs/pretrain/multimae-b_98_rgb+-depth-semseg_1600e.yaml --data_path /home/gargatik/gargatik/Datasets/copy/multimae/train"

I am facing the issue:


____Start___

../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [92,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [92,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "run_pretraining_multimae.py", line 585, in main(opts) File "run_pretraining_multimae.py", line 414, in main train_stats = train_one_epoch( File "run_pretraining_multimae.py", line 501, in train_one_epoch preds, masks = model( File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 312, in forward input_task_tokens = { File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 313, in domain: self.input_adaptersdomain File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/input_adapters.py", line 232, in forward x_patch = rearrange(self.proj(x), 'b d nh nw -> b (nh nw) d') File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f1c685031ee in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0xf3c2d (0x7f1caad91c2d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: + 0xf6f6e (0x7f1caad94f6e in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: + 0x463418 (0x7f1cba0f6418 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f1c684ea7a5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x35f2f5 (0x7f1cb9ff22f5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x679288 (0x7f1cba30c288 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f1cba30c655 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ccad3] frame #9: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5d270c] frame #10: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ec780] frame #11: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5441f8] frame #12: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a] frame #13: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a] frame #14: PyDict_SetItemString + 0x536 (0x5d1686 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #15: PyImport_Cleanup + 0x79 (0x684619 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #16: Py_FinalizeEx + 0x7f (0x67f8af in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #17: Py_RunMain + 0x32d (0x6b70fd in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #18: Py_BytesMain + 0x2d (0x6b736d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #19: __libc_start_main + 0xf3 (0x7f1cd8fc10b3 in /lib/x86_64-linux-gnu/libc.so.6) frame #20: _start + 0x2e (0x5fa5ce in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)

Traceback (most recent call last): File "run_pretraining_multimae.py", line 585, in main(opts) File "run_pretraining_multimae.py", line 414, in main train_stats = train_one_epoch( File "run_pretraining_multimae.py", line 501, in train_one_epoch preds, masks = model( File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 312, in forward input_task_tokens = { File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 313, in domain: self.input_adaptersdomain File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/input_adapters.py", line 232, in forward x_patch = rearrange(self.proj(x), 'b d nh nw -> b (nh nw) d') File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([256, 64, 56, 56], dtype=torch.half, device='cuda', requires_grad=True).to(memory_format=torch.channels_last) net = torch.nn.Conv2d(64, 768, kernel_size=[4, 4], padding=[0, 0], stride=[4, 4], dilation=[1, 1], groups=1) net = net.cuda().half().to(memory_format=torch.channels_last) out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams data_type = CUDNN_DATA_HALF padding = [0, 0, 0] stride = [4, 4, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0xc853ff10 type = CUDNN_DATA_HALF nbDims = 4 dimA = 256, 64, 56, 56, strideA = 200704, 1, 3584, 64, output: TensorDescriptor 0xc8540270 type = CUDNN_DATA_HALF nbDims = 4 dimA = 256, 768, 14, 14, strideA = 150528, 1, 10752, 768, weight: FilterDescriptor 0x819c34f0 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NHWC nbDims = 4 dimA = 768, 64, 4, 4, Pointer addresses: input: 0x7f12aa000000 output: 0x7f12ca000000 weight: 0x7f13d9200c00

terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f147c2811ee in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0xf3c2d (0x7f14beb0fc2d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: + 0xf6f6e (0x7f14beb12f6e in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: + 0x463418 (0x7f14cde74418 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f147c2687a5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x35f2f5 (0x7f14cdd702f5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x679288 (0x7f14ce08a288 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f14ce08a655 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ccad3] frame #9: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5d270c] frame #10: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ec780] frame #11: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5441f8] frame #12: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a] frame #13: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a] frame #14: PyDict_SetItemString + 0x536 (0x5d1686 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #15: PyImport_Cleanup + 0x79 (0x684619 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #16: Py_FinalizeEx + 0x7f (0x67f8af in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #17: Py_RunMain + 0x32d (0x6b70fd in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #18: Py_BytesMain + 0x2d (0x6b736d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python) frame #19: __libc_start_main + 0xf3 (0x7f14ecd3f0b3 in /lib/x86_64-linux-gnu/libc.so.6) frame #20: _start + 0x2e (0x5fa5ce in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182198 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182199 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182200 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182202 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182203 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182204 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182205 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 182201) of binary: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python Traceback (most recent call last): File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_pretraining_multimae.py FAILED

Failures:

------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2022-07-09_16:11:15 host : Norwalk rank : 3 (local_rank: 3) exitcode : -6 (pid: 182201) error_file: traceback : Signal 6 (SIGABRT) received by PID 182201 _______________________________________________________________________________________________________ ____________________________________________END__________________________________________________________ Thanks for the help
dmizr commented 1 year ago

Hi @hellfire504, Unfortunately, this error message is not very helpful and we never attempted to pre-train MultiMAE on the Celeb-HQ dataset, so it is difficult to help you out here. Using the CUDA_LAUNCH_BLOCKING=1 flag or running the same code on CPU should produce a more interpretable error message. In case you are still having some difficulties, try to inspect the tensors given by the dataloader before they are passed to the model, and make sure that they are of the right shape and dtype. Also, if you used a different model for semantic segmentation, don't forget to change the number of classes in the DOMAIN_CONF.

Best, David