NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.87k stars 893 forks source link

error with mpirun #449

Open lambda7xx opened 1 year ago

lambda7xx commented 1 year ago

Branch/Tag/Commit

9b6d718b52f10f08a810c0885e070789e462102b

Docker Image Version

nvcr.io/nvidia/pytorch:22.09-py3

GPU name

V100

CUDA Driver

Driver Version: 510.73.08

Reproduced Steps

1 I use the script to convert my model 

python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \
      -i opt-6.7b/ \
      -o opt-6.7b/c-model/ \
      -i_g 4 \
      -processes  8 \
      -weight_data_type fp16

python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \
      -i opt-6.7b/ \
      -o opt-6.7b/c-model/ \
      -i_g 8 \
      -processes  8 \
      -weight_data_type fp16

2 then I use the script to run my code


mpirun -n 4  --allow-run-as-root python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
  --tensor_para_size 2 \
  --pipeline_para_size 2 \
  --layer_num 32 \
  --input_len 32 \
  --head_num 32 \
  --size_per_head 128 \
  --weights_data_type "fp16" \
  --max_seq_len 2048 \
  --vocab_size 50272 \
  --vocab_file ../models/gpt2-vocab.json \
  --merges_file ../models/gpt2-merges.txt \
  --ckpt_path="/home/aiscuser/FasterTransformer/build/opt-6.7b/c-model/4-gpu"   >  4gpu.log  2>&1

3 my error is


=================================================

Initializing tensor and pipeline parallel...
Traceback (most recent call last):
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
    main()
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
    comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
  File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
    dist.init_process_group(backend=backend)
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
    default_pg = _new_process_group_helper(
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
    main()
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
    comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
  File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
    dist.init_process_group(backend=backend)
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
    default_pg = _new_process_group_helper(
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
    main()
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
    comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
  File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
    dist.init_process_group(backend=backend)
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
    default_pg = _new_process_group_helper(
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
    main()
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
    comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
  File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
    dist.init_process_group(backend=backend)
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
    default_pg = _new_process_group_helper(
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10072,1],0]

I use the official image to build it

byshiue commented 1 year ago

Can you run any program with MPI in and outside the docker?

lambda7xx commented 1 year ago

Thanks, I fix the problem

alexngng commented 1 year ago

I encounter the similar problem, how do you fix it? RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

chaizhongming commented 1 year ago

@lambda7xx how do you fix it?