dinger-ai / dingervod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
http://horovod.ai
Other
0 stars 0 forks source link

Custom Docker File Issues #1

Open minsub0922 opened 1 year ago

minsub0922 commented 1 year ago

Docker File Error

root@1dd007c03d48:/# horovodrun --gloo  -np 1 -H localhost:1 python horovod/examples/pytorch/pytorch_mnist.py
[0]<stderr>:/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:145: UserWarning: 
[0]<stderr>:NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[0]<stderr>:The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
[0]<stderr>:If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[0]<stderr>:
[0]<stderr>:  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO Failed to open libibverbs.so[.1]
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO Using network Socket
[0]<stdout>:NCCL version 2.12.12+cuda11.6
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stderr>:Traceback (most recent call last):
[0]<stdout>:
[0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/torch/mpi_ops.py", line 1285, in synchronize
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[0]<stderr>:RuntimeError: ncclCommInitRank failed: unhandled cuda error
[0]<stderr>:
[0]<stderr>:During handling of the above exception, another exception occurred:
[0]<stderr>:
[0]<stderr>:Traceback (most recent call last):
[0]<stderr>:  File "horovod/examples/pytorch/pytorch_mnist.py", line 263, in <module>
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stderr>:    main(args)
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stderr>:  File "horovod/examples/pytorch/pytorch_mnist.py", line 222, in main
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stderr>:    hvd.broadcast_parameters(model.state_dict(), root_rank=0)
[0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/torch/functions.py", line 59, in broadcast_parameters
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stderr>:    synchronize(handle)
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/torch/mpi_ops.py", line 1290, in synchronize
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stdout>:
[0]<stderr>:    raise HorovodInternalError(e)
[0]<stdout>:1dd007c03d48:1140:1206 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[0]<stderr>:horovod.common.exceptions.HorovodInternalError: ncclCommInitRank failed: unhandled cuda error
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:913 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:950 -> 1
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:963 -> 1
[0]<stdout>:
[0]<stdout>:1dd007c03d48:1140:1206 [0] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
[0]<stdout>:1dd007c03d48:1140:1206 [0] NCCL INFO init.cc:1084 -> 4
Process 0 exit with status code 1.
Terminating remaining workers after failure of Process 0.
Traceback (most recent call last):
  File "/usr/local/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 837, in run_commandline
    _run(args)
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 827, in _run
    return _run_static(args)
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 685, in _run_static
    _launch_job(args, settings, nics, command)
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 800, in _launch_job
    run_controller(args.use_gloo, gloo_run_fn,
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 754, in run_controller
    gloo_run()
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/launch.py", line 792, in gloo_run_fn
    gloo_run(settings, nics, env, driver_ip, command)
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/gloo_run.py", line 300, in gloo_run
    launch_gloo(command, exec_command, settings, nics, env, server_ip)
  File "/usr/local/lib/python3.8/dist-packages/horovod/runner/gloo_run.py", line 284, in launch_gloo
    raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 0
Exit code: 1
minsub0922 commented 1 year ago

Docker file Error`

root@1dd007c03d48:/horovod# horovodrun -np 1 -H localhost:1 python examples/pytorch/pytorch_mnist.py 
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO Failed to open libibverbs.so[.1]
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO Using network Socket
[1,0]<stdout>:NCCL version 2.12.12+cuda11.6
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/torch/mpi_ops.py", line 1285, in synchronize
[1,0]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[1,0]<stderr>:RuntimeError: ncclCommInitRank failed: unhandled cuda error
[1,0]<stderr>:
[1,0]<stderr>:During handling of the above exception, another exception occurred:
[1,0]<stderr>:
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "examples/pytorch/pytorch_mnist.py", line 263, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "examples/pytorch/pytorch_mnist.py", line 222, in main
[1,0]<stderr>:    hvd.broadcast_parameters(model.state_dict(), root_rank=0)
[1,0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/torch/functions.py", line 59, in broadcast_parameters
[1,0]<stderr>:    synchronize(handle)
[1,0]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/horovod/torch/mpi_ops.py", line 1290, in synchronize
[1,0]<stderr>:    raise HorovodInternalError(e)
[1,0]<stderr>:horovod.common.exceptions.HorovodInternalError: ncclCommInitRank failed: unhandled cuda error
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:913 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:950 -> 1
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:963 -> 1
[1,0]<stdout>:
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
[1,0]<stdout>:1dd007c03d48:5078:5144 [0] NCCL INFO init.cc:1084 -> 4
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[5094,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Versions

minsub0922 commented 3 weeks ago
image