KimMeen / Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"
https://arxiv.org/abs/2310.01728
Apache License 2.0
1.02k stars 179 forks source link

Error in outputs, batch_y = accelerator.gather_for_metrics((outputs, batch_y)) #86

Closed well0203 closed 2 weeks ago

well0203 commented 1 month ago

Hi, has anyone encountered this error? (I tried to increase batch_size to 64, 128) Training runs, but when it comes to vali_loss, vali_mae_loss = vali(args, accelerator, model, vali_data, vali_loader, criterion, mae_metric) in this row outputs, batch_y = accelerator.gather_for_metrics((outputs, batch_y)) I get the following error:

File ".local/lib/python3.11/site-packages/accelerate/accelerator.py", line 2242, in gather_for_metrics data = self.gather(input_data) ^^^^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/accelerator.py", line 2205, in gather return gather(tensor) ^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 378, in wrapper return function(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 439, in gather return _gpu_gather(tensor) ^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 358, in _gpu_gather return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 107, in recursively_apply return honor_type( ^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 81, in honor_type return type(obj)(generator) ^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 110, in recursively_apply( File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply return func(data, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 355, in _gpu_gather_one torch.distributed.all_gather(output_tensors, tensor) File ".local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/.local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2615, in all_gather work = default_pg.allgather([tensor_list], [tensor]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Failed to CUDA calloc async 24 bytes

kwuking commented 1 month ago

Hi, this error seems to be caused by an incorrect CUDA device environment or an inconsistent number of devices. Our default script requires 8 A100 GPUs for execution. Do you have this number of CUDA devices in your running environment?