Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
816 stars 111 forks source link

RuntimeError [video classification] #61

Closed Auroralyxa closed 2 years ago

Auroralyxa commented 2 years ago

using 5 p100,16G to train the model,I get partial error as below

`[06/13 08:50:33][INFO] logging.py:  99: json_stats: {"_type": "train_iter", "dt": 3.81263, "dt_data": 0.00393, "dt_net": 3.80870, "epoch": "1/100", "eta": "105 days, 15:00:37", "gpu_mem": "2.14G", "iter": "10770/24044", "loss": 5.19564, "lr": 0.00002, "top1_err": 87.50000, "top5_err": 72.50000}
INFO:slowfast.utils.logging:json_stats: {"_type": "train_iter", "dt": 4.11339, "dt_data": 0.00198, "dt_net": 4.11140, "epoch": "1/100", "eta": "113 days, 22:58:08", "gpu_mem": "2.13G", "iter": "10780/24044", "loss": 4.69862, "lr": 0.00002, "top1_err": 85.00000, "top5_err": 80.00000}
INFO:slowfast.utils.logging:json_stats: {"_type": "train_iter", "dt": 4.11334, "dt_data": 0.00396, "dt_net": 4.10938, "epoch": "1/100", "eta": "113 days, 22:56:03", "gpu_mem": "2.13G", "iter": "10780/24044", "loss": 4.69862, "lr": 0.00002, "top1_err": 85.00000, "top5_err": 80.00000}
INFO:slowfast.utils.logging:json_stats: {"_type": "train_iter", "dt": 4.11339, "dt_data": 0.00394, "dt_net": 4.10944, "epoch": "1/100", "eta": "113 days, 22:58:06", "gpu_mem": "2.13G", "iter": "10780/24044", "loss": 4.69862, "lr": 0.00002, "top1_err": 85.00000, "top5_err": 80.00000}
INFO:slowfast.utils.logging:json_stats: {"_type": "train_iter", "dt": 4.11337, "dt_data": 0.00377, "dt_net": 4.10960, "epoch": "1/100", "eta": "113 days, 22:57:26", "gpu_mem": "2.13G", "iter": "10780/24044", "loss": 4.69862, "lr": 0.00002, "top1_err": 85.00000, "top5_err": 80.00000}
[06/13 08:51:12][INFO] logging.py:  99: json_stats: {"_type": "train_iter", "dt": 4.11339, "dt_data": 0.00390, "dt_net": 4.10949, "epoch": "1/100", "eta": "113 days, 22:58:22", "gpu_mem": "2.14G", "iter": "10780/24044", "loss": 4.69862, "lr": 0.00002, "top1_err": 85.00000, "top5_err": 80.00000}
Traceback (most recent call last):
  File "tools/run_net.py", line 31, in <module>
    main()
  File "tools/run_net.py", line 23, in main
    launch_job(cfg=cfg, init_method=args.init_method, func=train)
  File "G:\lyx\UniFormer-main\video_classification\tools\slowfast\utils\misc.py", line 296, in launch_job
    torch.multiprocessing.spawn(
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\torch\multiprocessing\spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\torch\multiprocessing\spawn.py", line 188, in start_processes
    while not context.join():
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\torch\multiprocessing\spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 4 terminated with the following error:
Traceback (most recent call last):
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap
    fn(i, *args)
  File "G:\lyx\UniFormer-main\video_classification\tools\slowfast\utils\multiprocessing.py", line 60, in run
    ret = func(cfg)
  File "G:\lyx\UniFormer-main\video_classification\tools\train_net.py", line 485, in train
    train_epoch(
  File "G:\lyx\UniFormer-main\video_classification\tools\train_net.py", line 104, in train_epoch
    loss_scaler(loss, optimizer, clip_grad=cfg.SOLVER.CLIP_GRADIENT, parameters=model.parameters(), create_graph=is_second_order)
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\timm\utils\cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\torch\_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\uniformer\lib\site-packages\torch\autograd\__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete`

How to solve this error, do I need to adjust parameters? thanks in advance

Andy1621 commented 2 years ago

What is your PyTorch version? Do you use the PyTorch >= 1.9 as we recommended?

Auroralyxa commented 2 years ago

Yes,torch1.10.0 , cuda10.2

Andy1621 commented 2 years ago

Maybe P100 does not support some operations? I find that some people also meet this problem when using P100 for other repo. I only run the code on V100 and A100, but I never meet the problem...

Auroralyxa commented 2 years ago

Ok ,thanks for your reply.

Andy1621 commented 2 years ago

As there is no more activity, I am closing the issue, don't hesitate to reopen it if necessary.