Open JingyaHuang opened 3 months ago
Hello @JingyaHuang,
One of our engineers is working to reproduce the issue. We'll update this issue once we've got more information.
-Taylor
@aws-taylor : A gentle reminder on the updates.
@akhil-aws took a look from our side, this could be an implementation issue
My guess is that "if __name__ == '__main__':" was not used when using multiprocessing.
So the parallel_model_trace API spawns multiple processes to trace and compile the model.
So I recommend to run trace within a "if __name__ == '__main__'" block or a function.
Hi @jyang-aws, thanks for the suggestion! However following the guide when relaunching the example, we saw the execution timeout/hangs. Here is the log:
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 9.45G/9.45G [00:18<00:00, 521MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1.95G/1.95G [00:03<00:00, 490MB/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.10s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.86it/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 63.9kB/s]
starting encoder parallel trace
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 1.26MB/s]
spiece.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 392MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 3.57MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 44.4MB/s]
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:root:Unsupported nprocs (8), ignoring...
2024-May-21 16:08:34.103141 58092:58092 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd1
2024-May-21 16:08:34.107083 58092:58092 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.110713 58092:58092 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.114306 58092:58092 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.117930 58092:58092 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.121657 58092:58092 ERROR TDRV:tdrv_destroy TDRV not initialized
2024-May-21 16:08:34.124934 58092:58092 ERROR NRT:nrt_init Failed to initialize devices, error:1
2024-May-21 16:08:34.128752 58092:58092 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.133451 58092:58092 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.139484 58092:58092 ERROR NRT:nrt_infodump NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.146881 58092:58092 ERROR NRT:nrt_infodump CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.153827 58092:58092 ERROR NRT:nrt_infodump Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.160210 58092:58092 ERROR NRT:nrt_infodump Cluster ID: N/A
2024-May-21 16:08:34.166120 58092:58092 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.174007 58092:58092 ERROR NRT:nrt_infodump Nodename: ip-172-31-33-90
2024-May-21 16:08:34.180223 58092:58092 ERROR NRT:nrt_infodump Driver version: 2.16.7.0
2024-May-21 16:08:34.187890 58098:58098 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd1
2024-May-21 16:08:34.200773 58093:58093 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd2
2024-May-21 16:08:34.200799 58093:58093 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.200816 58093:58093 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.200823 58093:58093 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.200830 58093:58093 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.200927 58093:58093 ERROR TDRV:tdrv_destroy TDRV not initialized
2024-May-21 16:08:34.200941 58093:58093 ERROR NRT:nrt_init Failed to initialize devices, error:1
2024-May-21 16:08:34.200968 58093:58093 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.200978 58093:58093 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.200988 58093:58093 ERROR NRT:nrt_infodump NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.201001 58093:58093 ERROR NRT:nrt_infodump CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.201010 58093:58093 ERROR NRT:nrt_infodump Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.201019 58093:58093 ERROR NRT:nrt_infodump Cluster ID: N/A
2024-May-21 16:08:34.201025 58093:58093 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.201035 58093:58093 ERROR NRT:nrt_infodump Nodename: ip-172-31-33-90
2024-May-21 16:08:34.201046 58093:58093 ERROR NRT:nrt_infodump Driver version: 2.16.7.0
2024-May-21 16:08:34.201055 58093:58093 ERROR NRT:nrt_infodump Environment:
2024-May-21 16:08:34.201066 58093:58093 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.201080 58093:58093 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.201093 58093:58093 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.201101 58093:58093 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=4
2024-May-21 16:08:34.201112 58093:58093 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=4
2024-May-21 16:08:34.201121 58093:58093 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.201133 58093:58093 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.188282 58092:58092 ERROR NRT:nrt_infodump Environment:
2024-May-21 16:08:34.196323 58098:58098 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.202140 58092:58092 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.221171 58096:58096 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd2
2024-May-21 16:08:34.221195 58096:58096 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.221206 58096:58096 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.221215 58096:58096 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.221225 58096:58096 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.221328 58096:58096 ERROR TDRV:tdrv_destroy TDRV not initialized
2024-May-21 16:08:34.221343 58096:58096 ERROR NRT:nrt_init Failed to initialize devices, error:1
2024-May-21 16:08:34.221368 58096:58096 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.221379 58096:58096 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.221387 58096:58096 ERROR NRT:nrt_infodump NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.221399 58096:58096 ERROR NRT:nrt_infodump CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.221409 58096:58096 ERROR NRT:nrt_infodump Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.221418 58096:58096 ERROR NRT:nrt_infodump Cluster ID: N/A
2024-May-21 16:08:34.221426 58096:58096 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.221433 58096:58096 ERROR NRT:nrt_infodump Nodename: ip-172-31-33-90
2024-May-21 16:08:34.221447 58096:58096 ERROR NRT:nrt_infodump Driver version: 2.16.7.0
2024-May-21 16:08:34.221455 58096:58096 ERROR NRT:nrt_infodump Environment:
2024-May-21 16:08:34.221464 58096:58096 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.221474 58096:58096 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.221483 58096:58096 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.221493 58096:58096 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=5
2024-May-21 16:08:34.221499 58096:58096 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=5
2024-May-21 16:08:34.221513 58096:58096 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.221523 58096:58096 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.208521 58098:58098 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.216521 58092:58092 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.222911 58098:58098 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.229295 58092:58092 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.244955 58100:58100 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd3
2024-May-21 16:08:34.244980 58100:58100 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.244993 58100:58100 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.245006 58100:58100 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.245016 58100:58100 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.245127 58100:58100 ERROR TDRV:tdrv_destroy TDRV not initialized
2024-May-21 16:08:34.245142 58100:58100 ERROR NRT:nrt_init Failed to initialize devices, error:1
2024-May-21 16:08:34.245168 58100:58100 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.245178 58100:58100 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.245187 58100:58100 ERROR NRT:nrt_infodump NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.245199 58100:58100 ERROR NRT:nrt_infodump CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.245209 58100:58100 ERROR NRT:nrt_infodump Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.245216 58100:58100 ERROR NRT:nrt_infodump Cluster ID: N/A
2024-May-21 16:08:34.245222 58100:58100 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.245236 58100:58100 ERROR NRT:nrt_infodump Nodename: ip-172-31-33-90
2024-May-21 16:08:34.245254 58100:58100 ERROR NRT:nrt_infodump Driver version: 2.16.7.0
2024-May-21 16:08:34.245260 58100:58100 ERROR NRT:nrt_infodump Environment:
2024-May-21 16:08:34.245269 58100:58100 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.245276 58100:58100 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.245290 58100:58100 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.245302 58100:58100 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=6
2024-May-21 16:08:34.245313 58100:58100 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=6
2024-May-21 16:08:34.245325 58100:58100 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.245339 58100:58100 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.235640 58098:58098 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.242049 58092:58092 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=2
2024-May-21 16:08:34.248203 58098:58098 ERROR TDRV:tdrv_destroy TDRV not initialized
2024-May-21 16:08:34.254714 58092:58092 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=2
2024-May-21 16:08:34.262136 58098:58098 ERROR NRT:nrt_init Failed to initialize devices, error:1
2024-May-21 16:08:34.269570 58092:58092 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.276968 58098:58098 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.283870 58092:58092 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.290279 58098:58098 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.333970 58102:58102 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd3
2024-May-21 16:08:34.334546 58098:58098 ERROR NRT:nrt_infodump NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.341246 58102:58102 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.348286 58098:58098 ERROR NRT:nrt_infodump CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.354646 58102:58102 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.360930 58098:58098 ERROR NRT:nrt_infodump Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.367652 58102:58102 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.375127 58098:58098 ERROR NRT:nrt_infodump Cluster ID: N/A
2024-May-21 16:08:34.385789 58102:58102 ERROR TDRV:notification_destroy Notifications not initialized!
2024-May-21 16:08:34.392177 58098:58098 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.398998 58102:58102 ERROR TDRV:tdrv_destroy TDRV not initialized
2024-May-21 16:08:34.406910 58098:58098 ERROR NRT:nrt_infodump Nodename: ip-172-31-33-90
2024-May-21 16:08:34.413221 58102:58102 ERROR NRT:nrt_init Failed to initialize devices, error:1
2024-May-21 16:08:34.419577 58098:58098 ERROR NRT:nrt_infodump Driver version: 2.16.7.0
2024-May-21 16:08:34.425927 58102:58102 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2024-May-21 16:08:34.432307 58098:58098 ERROR NRT:nrt_infodump Environment:
2024-May-21 16:08:34.438358 58102:58102 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2024-May-21 16:08:34.444878 58098:58098 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:34.452365 58102:58102 ERROR NRT:nrt_infodump NRT version: 2.20.22.0 (1b3ca64250aae3cc8631db08e8deb1b6b4ea2f88)
2024-May-21 16:08:34.459826 58098:58098 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:34.474258 58098:58098 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:34.467241 58102:58102 ERROR NRT:nrt_infodump CCOM version: 2.20.22.0-c101c322e940b1 (compat 36)
2024-May-21 16:08:34.474270 58098:58098 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=3
2024-May-21 16:08:34.480664 58102:58102 ERROR NRT:nrt_infodump Instance ID: i-0975663b229a10fd1
2024-May-21 16:08:34.486484 58098:58098 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=3
2024-May-21 16:08:34.494338 58102:58102 ERROR NRT:nrt_infodump Cluster ID: N/A
2024-May-21 16:08:34.500504 58098:58098 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:34.508521 58102:58102 ERROR NRT:nrt_infodump Kernel: Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
2024-May-21 16:08:34.514324 58098:58098 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:34.525023 58102:58102 ERROR NRT:nrt_infodump Nodename: ip-172-31-33-90
2024-May-21 16:08:35.060021 58102:58102 ERROR NRT:nrt_infodump Driver version: 2.16.7.0
2024-May-21 16:08:35.068105 58102:58102 ERROR NRT:nrt_infodump Environment:
2024-May-21 16:08:35.073895 58102:58102 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
2024-May-21 16:08:35.084639 58102:58102 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2024-May-21 16:08:35.091359 58102:58102 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1,1,1,1,1,1,1
2024-May-21 16:08:35.098387 58102:58102 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=7
2024-May-21 16:08:35.104737 58102:58102 ERROR NRT:nrt_infodump NEURON_RT_VISIBLE_CORES=7
2024-May-21 16:08:35.111043 58102:58102 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.23
2024-May-21 16:08:35.117818 58102:58102 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2024-May-21 16:08:50.0671 58090:58703 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-May-21 16:08:50.0671 58090:58703 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2024-May-21 16:11:50.0603 58090:58703 [1] include/socket.h:468 CCOM WARN Connect to 127.0.0.1<62182> failed : Connection refused - retrying [bootstrapInit, rank: 1/8/-1]
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing data parallel with size 1
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.50it/s]
[2024-05-21 16:14:07.983: I neuronx_distributed/parallel_layers/checkpointing.py:148] `load` kwarg `model` is deprecated, please use `model_or_optimizer` instead as we are supporting to use `load` with optimizer as well
[2024-05-21 16:14:07.983: I neuronx_distributed/parallel_layers/checkpointing.py:161] loading checkpoint from flan-t5-xl.pt
2024-May-21 16:14:20.0020 58091:58701 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-May-21 16:14:20.0020 58091:58701 [0] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2024-May-21 16:16:20.0121 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 120 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:16:20.0229 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 120 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:16:20.0229 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 120 sec), [CommInitRankDev/-4]
2024-May-21 16:18:20.0221 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 240 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:18:20.0329 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 240 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:18:20.0329 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 240 sec), [CommInitRankDev/-4]
2024-May-21 16:22:20.0336 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 480 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:22:20.0529 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 480 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:22:20.0529 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 480 sec), [CommInitRankDev/-4]
2024-May-21 16:30:20.0736 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 960 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:30:20.0930 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 960 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:30:20.0930 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 960 sec), [CommInitRankDev/-4]
2024-May-21 16:46:21.0460 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 1920 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 16:46:21.0637 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 1920 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 16:46:21.0652 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 1920 sec), [CommInitRankDev/-4]
2024-May-21 17:18:22.0868 58090:58703 [1] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 3840 sec) - retrying, [bootstrapInit, rank: 1/8/-1]
2024-May-21 17:18:23.0052 58091:58701 [0] include/socket.h:541 CCOM WARN Timeout waiting for RX (waited 3840 sec) - retrying, [bootstrapInit, rank: 0/8/-1]
2024-May-21 17:18:23.0253 58091:59050 [-1] bootstrap.cc:95 CCOM WARN Timeout waiting for incoming connection (waited 3840 sec), [CommInitRankDev/-4]
Could you guide us on solving this? Thanks!
@jyang-aws -- Few of our neuron pipelines are blocked on this, sorry to push again, but it will be really helpful if you can prioritize this.
Hi team, I am trying to add tensor parallel support to T5-like models. But when I tried with the T5 TP official example, the tracing failed. Could the team help me better understand how I could solve it so that the team could continue with the work of support tp to T5 models.
Error log:
My environment setup:
Thanks team!