Hello, thank you for your excellent work and salute you. When I reproduced the code, Use the following command: python-mtrch.distributed.run-nproc per node = 8main.py-mode submit-config-path/home/sunzhaojie/memot/outputs/memot _ mot17/train/ config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model dab_deformable_detr.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/MOT17 The following error occurred while running the code:

Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 120, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main torch.cuda.set_device(distributed_rank()) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 120, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main torch.cuda.set_device(distributed_rank()) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 120, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main torch.cuda.set_device(distributed_rank()) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 120, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main Traceback (most recent call last): torch.cuda.set_device(distributed_rank()) File "/home/sunzhaojie/MeMOTR/main.py", line 120, in File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main torch.cuda.set_device(distributed_rank()) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 120, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main torch.cuda.set_device(distributed_rank()) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949687 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949688 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 3949689) of binary: /home/sunzhaojie/.conda/envs/mot13/bin/python Traceback (most recent call last): File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in main() File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: [1]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 3 (local_rank: 3) exitcode : 1 (pid: 3949690) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 4 (local_rank: 4) exitcode : 1 (pid: 3949691) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 5 (local_rank: 5) exitcode : 1 (pid: 3949692) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 6 (local_rank: 6) exitcode : 1 (pid: 3949693) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 7 (local_rank: 7) exitcode : 1 (pid: 3949694) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 2 (local_rank: 2) exitcode : 1 (pid: 3949689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

How can I solve this problem? Hope to reply!

Hello, I have modified some codes and run commands: python main.py --mode submit --config-path /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml --submit-dir /home/sunzhaojie/MeMOTR/ outputs/memotr_mot17/ --submit-model memotr_mot17.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/ To run the code on the specified CUDA device number without using a distributed method. As a result, the following error occurred:

Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 121, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main submit(config=config) File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 224, in submit submitter.run() File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 73, in run res = self.model(frame=frame, tracks=tracks) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sunzhaojie/MeMOTR/models/memotr.py", line 133, in forward outputs, init_reference, inter_references, inter_queries = self.transformer( File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/sunzhaojie/MeMOTR/models/deformable_transformer.py", line 225, in forward memory = checkpoint(self.encoder, src_flatten, spatial_shapes, level_start_index, File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 251, in checkpoint return _checkpoint_without_reentrant( File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 420, in _checkpoint_without_reentrant output = function(*args, kwargs) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 59, in forward output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 124, in forward src2 = self.self_attn(self.with_pos_embed(src, pos), File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sunzhaojie/MeMOTR/models/ops/modules/ms_deform_attn.py", line 129, in forward output = self.output_proj(output) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream()) How can I solve this problem? Please reply! thank you

I have not seen this error before. However, based on the error message you provided, it seems like there might be an issue with the CUDA memory or driver version on your system. May I ask for some details (like nvidia driver version/cuda version/pytorch version) of your environment?

And one more thing: I have reviewed my code and found an error in the configuration related to model training. I have already fixed it on the latest commit.

Thank you for your reply. Sorry, I haven't been exposed to distributed programs before. My cuda is 11.7 and torch is 1.13.1, so there should be no problem with this. I use two 3090 graphics cards, which should be nproc_pernode=2. But I modified the run command to python-mtrch. distributed.run-nproc per node = 2main.py-mode submit-config-path/home/sunzhaojie/memot/outputs/memot mot17/train/ config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model memotr_mot17.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/ A new error has occurred: Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 121, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main submit(config=config) File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 196, in submit model = DDP(module=model, device_ids=[distributed_rank()], find_unused_parameters=False) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device b3000 Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 121, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main submit(config=config) File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 196, in submit model = DDP(module=model, device_ids=[distributed_rank()], find_unused_parameters=False) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device b3000 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4019226) of binary: /home/sunzhaojie/.conda/envs/mot13/bin/python Traceback (most recent call last): File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in main() File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: [1]: time : 2023-12-15_19:59:43 host : ubuntu-Precision-7920-Tower rank : 1 (local_rank: 1) exitcode : 1 (pid: 4019227) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-15_19:59:43 host : ubuntu-Precision-7920-Tower rank : 0 (local_rank: 0) exitcode : 1 (pid: 4019226) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Please answer me, how can I solve it? Thank you.

Can you try this script and see if there is any error comes out?

python main.py --mode submit --config-path /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model memotr_mot17.pth --data-root /home/sunzhaojie/MeMOTR/dataset/

Different from your previous script without DDP, it removes --use-distributed.

Thank you very much for your command, and it has been able to run, but the following errors occurred during the operation:

(mot13) sunzhaojie@ubuntu-Precision-7920-Tower:~/MeMOTR$ python main.py --mode submit --config-path /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model memotr_mot17.pth --data-root /home/sunzhaojie/MeMOTR/dataset/

Configs: {'ACCUMULATION_STEPS': 1, 'ACTIVATION': 'ReLU', 'AUX_LOSS': True, 'AUX_LOSS_WEIGHT': [1.0, 1.0, 1.0, 1.0, 1.0], 'AVAILABLE_GPUS': '0,1', 'BACKBONE': 'resnet50', 'BATCH_SIZE': 1, 'CHECKPOINT_LEVEL': 2, 'CLIP_MAX_NORM': 0.1, 'COCO_SIZE': True, 'CONFIG_PATH': '/home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml', 'DATASET': 'MOT17', 'DATA_PATH': None, 'DATA_ROOT': '/home/sunzhaojie/MeMOTR/dataset/', 'DET_SCORE_THRESH': 0.5, 'DEVICE': 'cuda:1', 'DROPOUT': 0.0, 'EPOCHS': 130, 'EVAL_DATA_SPLIT': 'val', 'EVAL_DIR': None, 'EVAL_MODE': 'specific', 'EVAL_MODEL': None, 'EVAL_PORT': None, 'EVAL_THREADS': 1, 'EXTRA_TRACK_ATTN': False, 'FFN_DIM': 2048, 'FP_INSERT_RATE': 0.0, 'GIT_VERSION': None, 'HIDDEN_DIM': 256, 'LONG_MEMORY_LAMBDA': 0.01, 'LOSS_WEIGHT_FOCAL': 2, 'LOSS_WEIGHT_GIOU': 2, 'LOSS_WEIGHT_L1': 5, 'LR': 0.0002, 'LR_BACKBONE': 2e-05, 'LR_DROP_MILESTONES': [120], 'LR_DROP_RATE': 0.1, 'LR_POINTS': 2e-05, 'LR_SCHEDULER': 'MultiStep', 'MATCH_COST_BBOX': 5, 'MATCH_COST_CLASS': 2, 'MATCH_COST_GIOU': 2, 'MERGE_DET_TRACK_LAYER': 1, 'MISS_TOLERANCE': 15, 'MODE': 'submit', 'MOTION_LAMBDA': 0.5, 'MOTION_MAX_LENGTH': 5, 'MOTION_MIN_LENGTH': 3, 'MOTSYNTH_RATE': None, 'MULTI_CHECKPOINT': False, 'NUM_DEC_LAYERS': 6, 'NUM_DEC_POINTS': 4, 'NUM_DET_QUERIES': 300, 'NUM_ENC_LAYERS': 6, 'NUM_ENC_POINTS': 4, 'NUM_FEATURE_LEVELS': 4, 'NUM_HEADS': 8, 'NUM_WORKERS': 4, 'ONLY_TRAIN_QUERY_UPDATER_AFTER': 130, 'OUTPUTS_DIR': '/home/sunzhaojie/MeMOTR/outputs/MOT17', 'OVERFLOW_BBOX': True, 'PRETRAINED_MODEL': 'dab_deformable_detr.pth', 'RESULT_SCORE_THRESH': 0.5, 'RESUME': None, 'RESUME_SCHEDULER': True, 'RETURN_INTER_DEC': True, 'REVERSE_CLIP': 0.0, 'SAMPLE_INTERVALS': [10], 'SAMPLE_LENGTHS': [2, 3, 4, 5], 'SAMPLE_MODES': ['random_interval'], 'SAMPLE_MOT17_JOIN': 0, 'SAMPLE_STEPS': [60, 100], 'SEED': 42, 'SUBMIT_DATA_SPLIT': 'test', 'SUBMIT_DIR': '/home/sunzhaojie/MeMOTR/outputs/memotr_mot17/', 'SUBMIT_MODEL': 'memotr_mot17.pth', 'TP_DROP_RATE': 0.0, 'TRACK_SCORE_THRESH': 0.5, 'UPDATE_THRESH': 0.5, 'USE_CHECKPOINT': False, 'USE_CROWDHUMAN': None, 'USE_DAB': True, 'USE_DISTRIBUTED': False, 'USE_MOTION': False, 'USE_MOTSYNTH': None, 'VISUALIZE': False, 'WEIGHT_DECAY': 0.0001} Submit seq: MOT17-12-SDP: 0%| | 0/900 [00:00<?, ?it/s]/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/TensorShape.cpp:3190.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] Submit seq: MOT17-12-SDP: 9%|█████▎ | 85/900 [00:16<02:37, 5.16it/s] Traceback (most recent call last): File "/home/sunzhaojie/MeMOTR/main.py", line 121, in main(config=merged_config) File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main submit(config=config) File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 223, in submit submitter.run() File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 73, in run res = self.model(frame=frame, tracks=tracks) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/sunzhaojie/MeMOTR/models/memotr.py", line 133, in forward outputs, init_reference, inter_references, inter_queries = self.transformer( File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/sunzhaojie/MeMOTR/models/deformable_transformer.py", line 225, in forward memory = checkpoint(self.encoder, src_flatten, spatial_shapes, level_start_index, File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 251, in checkpoint return _checkpoint_without_reentrant( File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 420, in _checkpoint_without_reentrant output = function(*args, *kwargs) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 59, in forward output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 124, in forward src2 = self.self_attn(self.with_pos_embed(src, pos), File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sunzhaojie/MeMOTR/models/ops/modules/ms_deform_attn.py", line 129, in forward output = self.output_proj(output) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

Is the reason for this error that my graphics card is out of memory? Or do I need to modify something? Thank you for taking the time to answer my question. Thank you!

For inference, our model usually needs about 2GB CUDA memory. Is there any other program running on the same GPU? And you can use some tools to monitor the CUDA memory usage (like gpustat -i).

BTW, do you have a memory (RAM) overflow problem (you can use htop to monitor your RAM usage)?

As I haven't received your reply for a long time, I am closing this issue temporarily. You can re-open this issue if you need~

MCG-NJU / MeMOTR

Distributed operation #7

main.py FAILED

Root Cause (first observed failure): [0]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 2 (local_rank: 2) exitcode : 1 (pid: 3949689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

main.py FAILED

Failures: [1]: time : 2023-12-15_19:59:43 host : ubuntu-Precision-7920-Tower rank : 1 (local_rank: 1) exitcode : 1 (pid: 4019227) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-15_19:59:43 host : ubuntu-Precision-7920-Tower rank : 0 (local_rank: 0) exitcode : 1 (pid: 4019226) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html