aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
421 stars 136 forks source link

[Optimum-neuron]T5 tensor parallel official example not working as expected #851

Open JingyaHuang opened 3 months ago

JingyaHuang commented 3 months ago

Hi team, I am trying to add tensor parallel support to T5-like models. But when I tried with the T5 TP official example, the tracing failed. Could the team help me better understand how I could solve it so that the team could continue with the work of support tp to T5 models.

Error log:

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.53it/s]
starting encoder parallel trace
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.55it/s]
starting encoder parallel trace
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/t5_tp/t5_inf.py", line 19, in <module>
    traced_encoder = t5_models.parallel_trace_encoder(model_name, max_length, num_beams, tp_degree)
  File "/home/ubuntu/t5_tp/t5_models.py", line 67, in parallel_trace_encoder
    traced_encoder = neuronx_distributed.trace.parallel_model_trace(get_encoder_callable, (
  File "/home/ubuntu/pyvenv/aws_neuron_venv_2.17/lib/python3.8/site-packages/neuronx_distributed/trace/trace.py", line 152, in parallel_model_trace
    manager = ctx.Manager()
  File "/usr/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 579, in start
    self._process.start()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "t5_inf.py", line 19, in <module>
    traced_encoder = t5_models.parallel_trace_encoder(model_name, max_length, num_beams, tp_degree)
  File "/home/ubuntu/t5_tp/t5_models.py", line 67, in parallel_trace_encoder
    traced_encoder = neuronx_distributed.trace.parallel_model_trace(get_encoder_callable, (
  File "/home/ubuntu/pyvenv/aws_neuron_venv_2.17/lib/python3.8/site-packages/neuronx_distributed/trace/trace.py", line 152, in parallel_model_trace
    manager = ctx.Manager()
  File "/usr/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 583, in start
    self._address = reader.recv()
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

My environment setup:

Platform:

- Platform: Linux-5.15.0-1053-aws-x86_64-with-glibc2.29
- Python version: 3.8.10

Python packages:

- `optimum-neuron` version: 0.0.20.dev0
- `neuron-sdk` version: 2.17.0
- `optimum` version: 1.17.1
- `transformers` version: 4.36.2
- `huggingface_hub` version: 0.20.3
- `torch` version: 1.13.1+cu117
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 0.5.809
- `neuronx-cc` version: 2.12.68.0+4480452af
- `neuronx-distributed` version: 0.6.0
- `neuronx-hwm` version: 2.12.0.0+422c9037c
- `torch-neuronx` version: 1.13.1.1.13.1
- `torch-xla` version: 1.13.1+torchneurond
- `transformers-neuronx` version: 0.9.474

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-dkms/unknown,now 2.15.9.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.11.0-b7d33e68b amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.0.0 amd64 [installed]

Thanks team!

aws-taylor commented 3 months ago

Hello @JingyaHuang,

One of our engineers is working to reproduce the issue. We'll update this issue once we've got more information.

-Taylor

chintanckg commented 2 months ago

@aws-taylor : A gentle reminder on the updates.

jyang-aws commented 2 months ago

@akhil-aws took a look from our side, this could be an implementation issue

My guess is that  "if __name__ == '__main__':"  was not used when using multiprocessing.
So the parallel_model_trace API spawns multiple processes to trace and compile the model.

So I recommend to run trace within a "if __name__ == '__main__'" block or a function.
JingyaHuang commented 1 month ago

Hi @jyang-aws, thanks for the suggestion! However following the guide when relaunching the example, we saw the execution timeout/hangs. Here is the log:

model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 9.45G/9.45G [00:18<00:00, 521MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1.95G/1.95G [00:03<00:00, 490MB/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.10s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.86it/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 63.9kB/s]
starting encoder parallel trace
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 1.26MB/s]
spiece.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 392MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 3.57MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 44.4MB/s]
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:root:Unsupported nprocs (8), ignoring...
2024-May-21 16:08:34.103141 58092:58092 ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd1

2024-May-21 16:08:34.107083 58092:58092 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.110713 58092:58092 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.114306 58092:58092 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.117930 58092:58092 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.121657 58092:58092 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
2024-May-21 16:08:34.124934 58092:58092 ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
2024-May-21 16:08:34.128752 58092:58092 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.133451 58092:58092 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.139484 58092:58092 ERROR   NRT:nrt_infodump                            NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.146881 58092:58092 ERROR   NRT:nrt_infodump                            CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.153827 58092:58092 ERROR   NRT:nrt_infodump                            Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.160210 58092:58092 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-May-21 16:08:34.166120 58092:58092 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.174007 58092:58092 ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-33-90
2024-May-21 16:08:34.180223 58092:58092 ERROR   NRT:nrt_infodump                            Driver version: 2.16.7.0

2024-May-21 16:08:34.187890 58098:58098 ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd1

2024-May-21 16:08:34.200773 58093:58093 ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd2

2024-May-21 16:08:34.200799 58093:58093 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.200816 58093:58093 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.200823 58093:58093 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.200830 58093:58093 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.200927 58093:58093 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
2024-May-21 16:08:34.200941 58093:58093 ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
2024-May-21 16:08:34.200968 58093:58093 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.200978 58093:58093 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.200988 58093:58093 ERROR   NRT:nrt_infodump                            NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.201001 58093:58093 ERROR   NRT:nrt_infodump                            CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.201010 58093:58093 ERROR   NRT:nrt_infodump                            Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.201019 58093:58093 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-May-21 16:08:34.201025 58093:58093 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.201035 58093:58093 ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-33-90
2024-May-21 16:08:34.201046 58093:58093 ERROR   NRT:nrt_infodump                            Driver version: 2.16.7.0

2024-May-21 16:08:34.201055 58093:58093 ERROR   NRT:nrt_infodump                            Environment:
2024-May-21 16:08:34.201066 58093:58093 ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.201080 58093:58093 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.201093 58093:58093 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.201101 58093:58093 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=4
2024-May-21 16:08:34.201112 58093:58093 ERROR   NRT:nrt_infodump                                NEURON_RT_VISIBLE_CORES=4
2024-May-21 16:08:34.201121 58093:58093 ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.201133 58093:58093 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.188282 58092:58092 ERROR   NRT:nrt_infodump                            Environment:
2024-May-21 16:08:34.196323 58098:58098 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.202140 58092:58092 ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.221171 58096:58096 ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd2

2024-May-21 16:08:34.221195 58096:58096 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.221206 58096:58096 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.221215 58096:58096 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.221225 58096:58096 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.221328 58096:58096 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
2024-May-21 16:08:34.221343 58096:58096 ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
2024-May-21 16:08:34.221368 58096:58096 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.221379 58096:58096 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.221387 58096:58096 ERROR   NRT:nrt_infodump                            NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.221399 58096:58096 ERROR   NRT:nrt_infodump                            CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.221409 58096:58096 ERROR   NRT:nrt_infodump                            Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.221418 58096:58096 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-May-21 16:08:34.221426 58096:58096 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.221433 58096:58096 ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-33-90
2024-May-21 16:08:34.221447 58096:58096 ERROR   NRT:nrt_infodump                            Driver version: 2.16.7.0

2024-May-21 16:08:34.221455 58096:58096 ERROR   NRT:nrt_infodump                            Environment:
2024-May-21 16:08:34.221464 58096:58096 ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.221474 58096:58096 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.221483 58096:58096 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.221493 58096:58096 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=5
2024-May-21 16:08:34.221499 58096:58096 ERROR   NRT:nrt_infodump                                NEURON_RT_VISIBLE_CORES=5
2024-May-21 16:08:34.221513 58096:58096 ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.221523 58096:58096 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.208521 58098:58098 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.216521 58092:58092 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.222911 58098:58098 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.229295 58092:58092 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.244955 58100:58100 ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd3

2024-May-21 16:08:34.244980 58100:58100 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.244993 58100:58100 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.245006 58100:58100 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.245016 58100:58100 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.245127 58100:58100 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
2024-May-21 16:08:34.245142 58100:58100 ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
2024-May-21 16:08:34.245168 58100:58100 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.245178 58100:58100 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.245187 58100:58100 ERROR   NRT:nrt_infodump                            NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.245199 58100:58100 ERROR   NRT:nrt_infodump                            CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.245209 58100:58100 ERROR   NRT:nrt_infodump                            Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.245216 58100:58100 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-May-21 16:08:34.245222 58100:58100 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.245236 58100:58100 ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-33-90
2024-May-21 16:08:34.245254 58100:58100 ERROR   NRT:nrt_infodump                            Driver version: 2.16.7.0

2024-May-21 16:08:34.245260 58100:58100 ERROR   NRT:nrt_infodump                            Environment:
2024-May-21 16:08:34.245269 58100:58100 ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.245276 58100:58100 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.245290 58100:58100 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.245302 58100:58100 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=6
2024-May-21 16:08:34.245313 58100:58100 ERROR   NRT:nrt_infodump                                NEURON_RT_VISIBLE_CORES=6
2024-May-21 16:08:34.245325 58100:58100 ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.245339 58100:58100 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.235640 58098:58098 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.242049 58092:58092 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=2
2024-May-21 16:08:34.248203 58098:58098 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
2024-May-21 16:08:34.254714 58092:58092 ERROR   NRT:nrt_infodump                                NEURON_RT_VISIBLE_CORES=2
2024-May-21 16:08:34.262136 58098:58098 ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
2024-May-21 16:08:34.269570 58092:58092 ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.276968 58098:58098 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.283870 58092:58092 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.290279 58098:58098 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.333970 58102:58102 ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd3

2024-May-21 16:08:34.334546 58098:58098 ERROR   NRT:nrt_infodump                            NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.341246 58102:58102 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.348286 58098:58098 ERROR   NRT:nrt_infodump                            CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.354646 58102:58102 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.360930 58098:58098 ERROR   NRT:nrt_infodump                            Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.367652 58102:58102 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.375127 58098:58098 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-May-21 16:08:34.385789 58102:58102 ERROR  TDRV:notification_destroy                    Notifications not initialized! 
2024-May-21 16:08:34.392177 58098:58098 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.398998 58102:58102 ERROR  TDRV:tdrv_destroy                            TDRV not initialized
2024-May-21 16:08:34.406910 58098:58098 ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-33-90
2024-May-21 16:08:34.413221 58102:58102 ERROR   NRT:nrt_init                                Failed to initialize devices, error:1
2024-May-21 16:08:34.419577 58098:58098 ERROR   NRT:nrt_infodump                            Driver version: 2.16.7.0

2024-May-21 16:08:34.425927 58102:58102 ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.432307 58098:58098 ERROR   NRT:nrt_infodump                            Environment:
2024-May-21 16:08:34.438358 58102:58102 ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.444878 58098:58098 ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.452365 58102:58102 ERROR   NRT:nrt_infodump                            NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.459826 58098:58098 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.474258 58098:58098 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.467241 58102:58102 ERROR   NRT:nrt_infodump                            CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.474270 58098:58098 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=3
2024-May-21 16:08:34.480664 58102:58102 ERROR   NRT:nrt_infodump                            Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.486484 58098:58098 ERROR   NRT:nrt_infodump                                NEURON_RT_VISIBLE_CORES=3
2024-May-21 16:08:34.494338 58102:58102 ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2024-May-21 16:08:34.500504 58098:58098 ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.508521 58102:58102 ERROR   NRT:nrt_infodump                            Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.514324 58098:58098 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.525023 58102:58102 ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-33-90
2024-May-21 16:08:35.060021 58102:58102 ERROR   NRT:nrt_infodump                            Driver version: 2.16.7.0

2024-May-21 16:08:35.068105 58102:58102 ERROR   NRT:nrt_infodump                            Environment:
2024-May-21 16:08:35.073895 58102:58102 ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:35.084639 58102:58102 ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:35.091359 58102:58102 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:35.098387 58102:58102 ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=7
2024-May-21 16:08:35.104737 58102:58102 ERROR   NRT:nrt_infodump                                NEURON_RT_VISIBLE_CORES=7
2024-May-21 16:08:35.111043 58102:58102 ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:35.117818 58102:58102 ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:50.0671 58090:58703 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-May-21 16:08:50.0671 58090:58703 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2024-May-21 16:11:50.0603 58090:58703 [1] include/socket.h:468 CCOM WARN Connect to 127.0.0.1<62182> failed : Connection refused - retrying [bootstrapInit, rank: 1/8/-1]
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing data parallel with size 1
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.50it/s]
[2024-05-21 16:14:07.983: I neuronx_distributed/parallel_layers/checkpointing.py:148] `load` kwarg `model` is deprecated, please use `model_or_optimizer` instead as we are supporting to use `load` with optimizer as well
[2024-05-21 16:14:07.983: I neuronx_distributed/parallel_layers/checkpointing.py:161] loading checkpoint from flan-t5-xl.pt
2024-May-21 16:14:20.0020 58091:58701 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-May-21 16:14:20.0020 58091:58701 [0] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2024-May-21 16:16:20.0121 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 120 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:16:20.0229 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 120 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:16:20.0229 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 120 sec), [CommInitRankDev/-4]
2024-May-21 16:18:20.0221 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 240 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:18:20.0329 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 240 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:18:20.0329 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 240 sec), [CommInitRankDev/-4]
2024-May-21 16:22:20.0336 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 480 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:22:20.0529 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 480 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:22:20.0529 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 480 sec), [CommInitRankDev/-4]
2024-May-21 16:30:20.0736 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 960 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:30:20.0930 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 960 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:30:20.0930 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 960 sec), [CommInitRankDev/-4]
2024-May-21 16:46:21.0460 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 1920 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:46:21.0637 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 1920 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:46:21.0652 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 1920 sec), [CommInitRankDev/-4]
2024-May-21 17:18:22.0868 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 3840 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 17:18:23.0052 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 3840 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 17:18:23.0253 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 3840 sec), [CommInitRankDev/-4]

Could you guide us on solving this? Thanks!

chintan-ushur commented 1 month ago

@jyang-aws -- Few of our neuron pipelines are blocked on this, sorry to push again, but it will be really helpful if you can prioritize this.