Closed kamal-rahimi closed 4 months ago
Hello. The log snippet isn't showing the error. Can you please attach the full log file?
Also, can you include the run command you're using?
Take a look at these Docker files for PyTorch Distributed training:
Hi @rauteric, sure here is the full log:
12024-07-23 15:39:28,736 INFO job_manager.py:530 -- Runtime env is setting up.
[2024-07-23 15:39:58,215] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-07-23 15:39:58,215] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
df: /root/.triton/autotune: No such file or directory
Copying file from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf/refs/main to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/refs/main
Time Taken to Copy: 1.0632057189941406
Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf
Time Taken to Sync: 1.7107725143432617
Copying file from /tmp/ray_results/code.zip to s3://ray-training-output-ue1/demo_llama2_20240723-154138/code.zip
Time Taken to Copy: 0.650888204574585
2024-07-23 15:41:40,487 INFO worker.py:1460 -- Using address 10.67.122.45:6379 set in the environment variable RAY_ADDRESS
2024-07-23 15:41:40,487 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 10.67.122.45:6379...
2024-07-23 15:41:40,494 INFO worker.py:1779 -- Connected to Ray cluster. View the dashboard at [1m[32m10.67.122.45:8265 [39m[22m
View detailed results here: ray-training-output-ue1/demo_llama2_20240723-154138
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-07-23_15-09-19_207150_1/artifacts/2024-07-23_15-41-41/demo_llama2_20240723-154138/driver_artifacts`
[36m(autoscaler +1m49s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[36m(autoscaler +1m49s)[0m Adding 2 node(s) of type ray-worker-p5.48xlarge.
2024-07-23 15:42:41,678 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:43:41,731 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:44:41,777 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:45:41,824 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:46:41,871 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:47:41,904 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:48:41,954 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:49:42,003 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:50:42,049 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:51:42,092 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:52:42,126 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:53:42,162 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:54:42,198 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:55:42,240 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:56:42,283 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2024-07-23 15:57:42,315 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
[36m(autoscaler +18m25s)[0m Resized to 384 CPUs, 16 GPUs.
[36m(autoscaler +18m25s)[0m Adding 2 node(s) of type ray-worker-p5.48xlarge.
2024-07-23 15:58:42,357 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 113.0 CPUs and 16.0 GPUs, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
[36m(TrainTrainable pid=790, ip=10.67.132.202)[0m [2024-07-23 15:58:47,004] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[36m(TrainTrainable pid=790, ip=10.67.132.202)[0m [2024-07-23 15:58:47,005] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[36m(TrainTrainable pid=790, ip=10.67.132.202)[0m df: /root/.triton/autotune: No such file or directory
Training started with configuration:
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/checkpointing_kwargs/mode min │
│ train_loop_config/checkpointing_kwargs/monitor val_loss │
│ train_loop_config/checkpointing_kwargs/save_top_k 1 │
│ train_loop_config/ckpt_path │
│ train_loop_config/comet_experiment_kwargs/experiment_key │
│ train_loop_config/comet_experiment_kwargs/log_code False │
│ train_loop_config/comet_experiment_kwargs/log_git_metadata False │
│ train_loop_config/comet_experiment_kwargs/log_git_patch False │
│ train_loop_config/comet_experiment_kwargs/project_name adsk-ailab-ray │
│ train_loop_config/comet_experiment_kwargs/workspace adsk-ailab-tests │
│ train_loop_config/datamodule_cls ...GSM8KDataModule'> │
│ train_loop_config/datamodule_kwargs/batch_size 16 │
│ train_loop_config/datamodule_kwargs/data_loader_num_workers 0 │
│ train_loop_config/datamodule_kwargs/local_data_dir /tmp/data │
│ train_loop_config/datamodule_kwargs/tokenizer ... special=True),
} │
│ train_loop_config/exp_name ...2_20240723-154138 │
│ train_loop_config/local_data_dir /tmp/data │
│ train_loop_config/local_lightning_results_dir ...-154138/lightning │
│ train_loop_config/local_results_dir /tmp/ray_results │
│ train_loop_config/model_cls ...del.Llama2Model'> │
│ train_loop_config/model_kwargs/lr 5e-06 │
│ train_loop_config/model_kwargs/model_bucket_uri ...lama-2-7b-chat-hf │
│ train_loop_config/model_kwargs/model_checkpoint_path ...a5467ad31b3b84ff0 │
│ train_loop_config/model_kwargs/model_download_dir ...lama-2-7b-chat-hf │
│ train_loop_config/model_kwargs/model_id ...lama-2-7b-chat-hf │
│ train_loop_config/model_kwargs/no_grad_ckpt False │
│ train_loop_config/model_kwargs/num_training_steps 1000 │
│ train_loop_config/model_kwargs/strategy deepspeed │
│ train_loop_config/model_kwargs/vocab_size 32004 │
│ train_loop_config/ray_checkpointing_every_n_train_steps │
│ train_loop_config/s3_data_uri ...aphs-dataset/test │
│ train_loop_config/s3_lightning_results_uri ...-154138/lightning │
│ train_loop_config/s3_results_uri ...aining-output-ue1 │
│ train_loop_config/stage fit │
│ train_loop_config/strategy deepspeed │
│ train_loop_config/strategy_kwargs/config/bf16/enabled auto │
│ train_loop_config/strategy_kwargs/config/steps_per_print 10 │
│ train_loop_config/strategy_kwargs/config/wall_clock_breakdown False │
│ train_loop_config/strategy_kwargs/config/zero_optimization/contiguous_gradients True │
│ train_loop_config/strategy_kwargs/config/zero_optimization/offload_optimizer/device cpu │
│ train_loop_config/strategy_kwargs/config/zero_optimization/offload_optimizer/pin_memory True │
│ train_loop_config/strategy_kwargs/config/zero_optimization/offload_param/device cpu │
│ train_loop_config/strategy_kwargs/config/zero_optimization/offload_param/pin_memory True │
│ train_loop_config/strategy_kwargs/config/zero_optimization/overlap_comm True │
│ train_loop_config/strategy_kwargs/config/zero_optimization/reduce_bucket_size 500000000.0 │
│ train_loop_config/strategy_kwargs/config/zero_optimization/round_robin_gradients True │
│ train_loop_config/strategy_kwargs/config/zero_optimization/stage 3 │
│ train_loop_config/strategy_kwargs/config/zero_optimization/stage3_gather_16bit_weights_on_model_save True │
│ train_loop_config/strategy_kwargs/config/zero_optimization/stage3_max_live_parameters 1000000000.0 │
│ train_loop_config/strategy_kwargs/config/zero_optimization/stage3_max_reuse_distance 1000000000.0 │
│ train_loop_config/strategy_kwargs/config/zero_optimization/stage3_param_persistence_threshold 1000000.0 │
│ train_loop_config/strategy_kwargs/config/zero_optimization/stage3_prefetch_bucket_size 500000000.0 │
│ train_loop_config/strategy_kwargs/config/zero_optimization/sub_group_size 1000000000.0 │
│ train_loop_config/trainer_kwargs/accumulate_grad_batches 1 │
│ train_loop_config/trainer_kwargs/logger ... 0x7f5ec83ec6a0>] │
│ train_loop_config/trainer_kwargs/max_epochs 1 │
│ train_loop_config/trainer_kwargs/precision bf16 │
│ train_loop_config/use_gpu True │
│ train_loop_config/use_ray_data False │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Setting up process group for: env:// [rank=0, world_size=16]
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m [W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m Started distributed worker processes:
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=950) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=951) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=952) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=953) world_rank=3, local_rank=3, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=955) world_rank=4, local_rank=4, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=956) world_rank=5, local_rank=5, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=954) world_rank=6, local_rank=6, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.202, pid=957) world_rank=7, local_rank=7, node_rank=0
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=944) world_rank=8, local_rank=0, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=945) world_rank=9, local_rank=1, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=946) world_rank=10, local_rank=2, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=947) world_rank=11, local_rank=3, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=948) world_rank=12, local_rank=4, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=949) world_rank=13, local_rank=5, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=950) world_rank=14, local_rank=6, node_rank=1
[36m(TorchTrainer pid=790, ip=10.67.132.202)[0m - (ip=10.67.132.117, pid=951) world_rank=15, local_rank=7, node_rank=1
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m [2024-07-23 15:58:57,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m [93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m [93m [WARNING] [0m async_io: please install the libaio-dev package with apt
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m [93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m [93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m df: /root/.triton/autotune: No such file or directory
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m [93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m [93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Syncing files from s3://sketch-graphs-dataset/test to /tmp/data
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Time Taken to Sync: 0.9033362865447998
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m [2024-07-23 15:58:58,060] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [93m [WARNING] [0m async_io: please install the libaio-dev package with apt[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m /opt/miniconda/lib/python3.10/site-packages/lightning_fabric/connector.py:571: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m [W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m [93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m [93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m GPU available: True (cuda), used: True
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m df: /root/.triton/autotune: No such file or directory
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m Syncing files from s3://sketch-graphs-dataset/test to /tmp/data[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m Time Taken to Sync: 0.9122803211212158[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/lightning_fabric/connector.py:571: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead![32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m Syncing files from s3://sketch-graphs-dataset/test to /tmp/data[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m Time Taken to Sync: 0.9363911151885986[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m /opt/miniconda/lib/python3.10/site-packages/lightning_fabric/connector.py:571: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead![32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m Syncing files from s3://sketch-graphs-dataset/test to /tmp/data[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m Time Taken to Sync: 0.9446144104003906[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/lightning_fabric/connector.py:571: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead![32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m Syncing files from s3://sketch-graphs-dataset/test to /tmp/data[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Time Taken to Sync: 0.9098861217498779[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.37s/it]
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.55s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.67s/it]
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m warnings.warn(
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m warnings.warn(
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m warnings.warn(
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m warnings.warn(
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m Time Taken to Sync: 0.9018194675445557[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m
Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.17s/it][32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.45s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.56s/it][32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m warnings.warn([32m [repeated 20x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m Time Taken to Sync: 0.8856556415557861[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m initializing deepspeed distributed: GLOBAL_RANK: 8, MEMBER: 9/16
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Missing logger folder: logs/lightning_logs
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m
Loading checkpoint shards: 50%|█████ | 1/2 [00:03<00:03, 3.21s/it][32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO Bootstrap : Using eth0:10.67.132.202<0>
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:2243 [0] NCCL INFO cudaDriverVersion 12030
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m NCCL version 2.20.5+cuda12.1
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Using Libfabric version 1.19
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Using CUDA driver version 12030
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Selected Provider is efa (found 30 nics)
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI Using transport protocol RDMA
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 1 device #2 0000:60:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-maleksk-
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00, 1.85s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00, 2.05s/it][32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m warnings.warn([32m [repeated 16x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m Time Taken to Sync: 0.947108268737793[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m initializing deepspeed distributed: GLOBAL_RANK: 4, MEMBER: 5/16[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m Missing logger folder: logs/lightning_logs[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m Syncing files from s3://llama-2-weights/models--meta-llama--Llama-2-7b-chat-hf to /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m
Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.12s/it][32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:2239 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:2239 [2] NCCL INFO Bootstrap : Using eth0:10.67.132.117<0>[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:2239 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:2239 [2] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:2239 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:2239 [2] NCCL INFO cudaDriverVersion 12030[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Using Libfabric version 1.19[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Using CUDA driver version 12030[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Configuring AWS-specific options[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Internode latency set at 75.0 us[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Selected Provider is efa (found 30 nics)[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI Using transport protocol RDMA[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0[32m [repeated 108x across cluster][0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.43s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.54s/it][32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m warnings.warn([32m [repeated 24x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m /tmp/ray/session_2024-07-23_15-09-19_207150_1/runtime_resources/pip/65d1f8c1a8ed8c42bc7bbe1ceebdbe22d794eb29/virtualenv/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m initializing deepspeed distributed: GLOBAL_RANK: 14, MEMBER: 15/16[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m Missing logger folder: logs/lightning_logs[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m
Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.14s/it]
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:2235 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:2235 [3] NCCL INFO Bootstrap : Using eth0:10.67.132.117<0>[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:2235 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:2235 [3] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:2235 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:2235 [3] NCCL INFO cudaDriverVersion 12030[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Using Libfabric version 1.19[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Using CUDA driver version 12030[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Configuring AWS-specific options[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Internode latency set at 75.0 us[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Selected Provider is efa (found 30 nics)[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI Using transport protocol RDMA[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0[32m [repeated 108x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m initializing deepspeed distributed: GLOBAL_RANK: 9, MEMBER: 10/16[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m Missing logger folder: logs/lightning_logs[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] nccl_net_ofi_rdma_init:5966 NCCL WARN NET/OFI Wrong number of NICs for device 1. Expected 4 but got 3
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] nccl_net_ofi_create_plugin:1018 NCCL WARN NET/OFI Failed to initialize rdma protocol
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO net.cc:56 -> 2
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/IB : No device found.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/Socket : Using [0]eth0:10.67.132.202<0>
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Using non-device net plugin version 0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Using network Socket
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO comm 0x7f25293fcdd0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 53000 commId 0xb28cabf3135b26f2 - Init START
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NVLS multicast support is available on dev 0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO comm 0x7f25293fcdd0 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NVLS Head 0: 0 8
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 00/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 01/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 02/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 03/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 04/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 05/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 06/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 07/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 08/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 09/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 10/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 11/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-mal
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:2250 [5] NCCL INFO Bootstrap : Using eth0:10.67.132.202<0>[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:2250 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:2250 [5] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:2250 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:2250 [5] NCCL INFO cudaDriverVersion 12030[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Using Libfabric version 1.19[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Using CUDA driver version 12030[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Configuring AWS-specific options[32m [repeated 3x across cluster][0m
[36m(R1ayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Internode latency set at 75.0 us[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Selected Provider is efa (found 30 nics)[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:956:3163 [5] NCCL INFO NET/OFI Using transport protocol RDMA[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0[32m [repeated 58x across cluster][0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m eksk-bernini-worker-p5:950:3034 [0] NCCL INFO Channel 12/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 13/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 14/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 15/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8 [2] 1/8/-1->0->-1 [3] 1/-1/-1->0->8 [4] 1/8/-1->0->-1 [5] 1/-1/-1->0->8 [6] 1/8/-1->0->-1 [7] 1/-1/-1->0->8 [8] 1/8/-1->0->-1 [9] 1/-1/-1->0->8 [10] 1/8/-1->0->-1 [11] 1/-1/-1->0->8 [12] 1/8/-1->0->-1 [13] 1/-1/-1->0->8 [14] 1/8/-1->0->-1 [15] 1/-1/-1->0->8
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO NCCL_P2P_NET_CHUNKSIZE set by environment to 524288.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO P2P Chunksize set to 524288
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 01/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 02/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 03/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 05/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 06/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 07/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 08/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 09/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 10/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 11/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 12/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 13/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 14/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 15/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 00/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 02/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 03/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 04/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 05/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 06/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 07/0 : 0[0] -> 7[7] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 08/0 : 0[0] -> 7[7] via P2P/
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m ini-worker-ray-maleksk-b
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3093 [2] NCCL INFO Channel
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3]
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3103 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3103 [1] NCCL INFO Channel 01/0 : 1[1] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Connected all rings
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-maleks
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3103 [1] NCCL I
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m [0] -> 9[1] via P2P/CUMEM
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m 02/0 : 9[1] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m ini-wo
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Downloading readme: 0%| | 0.00/7.94k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 7.94k/7.94k [00:00<00:00, 50.8MB/s]
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m initializing deepspeed distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m Missing logger folder: logs/lightning_logs
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Downloading data: 0%| | 0.00/2.31M [00:00<?, ?B/s]
Downloading data: 100%|██████████| 2.31M/2.31M [00:00<00:00, 39.8MB/s]
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Downloading data: 0%| | 0.00/419k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 419k/419k [00:00<00:00, 10.7MB/s]
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Generating train split: 0%| | 0/7473 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 574403.20 examples/s]
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Generating test split: 0%| | 0/1319 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 573472.27 examples/s]
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] nccl_net_ofi_rdma_init:5966 NCCL WARN NET/OFI Wrong number of NICs for device 1. Expected 4 but got 3[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] nccl_net_ofi_create_plugin:1018 NCCL WARN NET/OFI Failed to initialize rdma protocol[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO net.cc:56 -> 2[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NET/IB : No device found.[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NET/Socket : Using [0]eth0:10.67.132.202<0>[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO Using non-device net plugin version 0[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO Using network Socket[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO comm 0x7f3aed3f8c00 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId b9000 commId 0xb28cabf3135b26f2 - Init START[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NVLS multicast support is available on dev 6[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO comm 0x7f3aed3f8c00 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NVLS Head 0: 0 8[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 06/0 : 8[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo[32m [repeated 30x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0[32m [repeated 180x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608.[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO NCCL_P2P_NET_CHUNKSIZE set by environment to 524288.[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:954:3133 [6] NCCL INFO P2P Chunksize set to 524288[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [receive] via NET/Socket/0[32m [repeated 44x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:945:3154 [1] NCCL INFO Channel 06/0 : 9[1] -> 8[0] via P2P/CUMEM[32m [repeated 123x across cluster][0m
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-maleksk-b
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:945:3154 [1] NCCL INFO Channel
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:945:3154 [1] NCCL INFO Channel 15/0 : 9[1] -> 0[0] [send] via NET/Socket/0[32m [repeated 39x across cluster][0m
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:945:3154 [1] NCCL INFO Connected all rings[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Downloading readme: 0%| | 0.00/7.94k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 7.94k/7.94k [00:00<00:00, 46.4MB/s]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Downloading data: 0%| | 0.00/419k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 419k/419k [00:00<00:00, 9.88MB/s][32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Generating train split: 0%| | 0/7473 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 634237.83 examples/s]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Generating test split: 0%| | 0/1319 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 544730.90 examples/s]
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m Creating extension directory /root/.cache/torch_extensions/py310_cu121/cpu_adam...
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m Building extension module cpu_adam...
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/miniconda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m [2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/miniconda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Time to load cpu_adam op: 25.030778646469116 seconds
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Loading extension module cpu_adam...
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7][32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m Creating extension directory /root/.cache/torch_extensions/py310_cu121/cpu_adam...
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m Building extension module cpu_adam...
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m [3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/miniconda/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ernini-worker-p5:953:3054 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Connected all rings
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Connected all trees
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO NVLS comm 0x7ed9a9400fc0 headRank -1 nHeads 1 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 335544320
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO Connected NVLS tree
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3054 [3] NCCL INFO comm 0x7ed9a9400fc0 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 86000 commId 0xb28cabf3135b26f2 - Init COMPLETE
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m [rank3]:[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m UMEM
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ] NCCL INFO Channel 10/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 11/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3025 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m 02/0 : 10[2] -> 9[1] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m k-bernini-worker-p5:950:3034 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m NFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Using non-device net plugin version 0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Using network Socket
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO bootstrapSplit: comm 0x7f2529911320 parent 0x7f25293fcdd0 rank 0 nranks 16 color 1197013201 key 0 prev 15 next 1 - DONE
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO comm 0x7f2529911320 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 53000 commId 0xb2d16673d0abaf0a - Init START
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO NVLS multicast support is available on dev 0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO comm 0x7f2529911320 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO NVLS Head 0: 0 8
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 00/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 01/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 02/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 03/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 04/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 05/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 06/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 07/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 08/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 09/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 10/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 11/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 12/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 13/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 14/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 15/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8 [2] 1/8/-1->0->-1 [3] 1/-1/-1->0->8 [4] 1/8/-1->0->-1 [5] 1/-1/-1->0->8 [6] 1/8/-1->0->-1 [7] 1/-1/-1->0->8 [8] 1/8/-1->0->-1 [9] 1/-1/-1->0->8 [10] 1/8/-1->0->-1 [11] 1/-1/-1->0->8 [12] 1/8/-1->0->-1 [13] 1/-1/-1->0->8 [14] 1/8/-1->0->-1 [15] 1/-1/-1->0->8
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO P2P Chunksize set to 524288
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-maleksk-bernini
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m [1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/miniconda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/miniconda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/miniconda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/miniconda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m Time to load cpu_adam op: 25.199923515319824 seconds[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m [3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/miniconda/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-work
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/CUMEM[32m [repeated 321x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO Connected all rings[32m [repeated 11x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO Connected all trees[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO NVLS comm 0x7ee1f93fb0f0 headRank -1 nHeads 1 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 335544320[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO Connected NVLS tree[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:947:3124 [3] NCCL INFO comm 0x7ee1f93fb0f0 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 86000 commId 0xb28cabf3135b26f2 - Init COMPLETE[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3034 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/Socket/0[32m [repeated 16x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 15/0 : 9[1] -> 0[0] [receive] via NET/Socket/0[32m [repeated 31x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:946:3816 [2] NCCL INFO Channel 0
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m ini-worker-ray-worker-p5
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m ini-worker-ray-maleksk-bern
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:945:3824 [1] NCCL INF
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ra
[36m(RayTrainWorker pid=952, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:952:3837 [2]
[36m(RayTrainWorker pid=953, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:953:3835 [3]
[36m(RayTrainWorker pid=956, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=954, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:
[36m(RayTrainWorker pid=951, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:95
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-maleksk-berni
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Parameter Offload: Total persistent parameters: 266240 in 65 params
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Using non-device net plugin version 0[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Using network Socket[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO bootstrapSplit: comm 0x7fb90bbb1f60 parent 0x7fb909403e40 rank 7 nranks 16 color 1197013201 key 7 prev 6 next 8 - DONE[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO comm 0x7fb90bbb1f60 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId ca000 commId 0xb2d16673d0abaf0a - Init START[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO NVLS multicast support is available on dev 7[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO comm 0x7fb90bbb1f60 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO NVLS Head 0: 0 8[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO P2P Chunksize set to 524288[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM[32m [repeated 320x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Connected all rings[32m [repeated 16x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Connected all trees[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO NVLS comm 0x7fb90bbb1f60 headRank -1 nHeads 1 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 335544320[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO Connected NVLS tree[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:957:3839 [7] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 15/0 : 1[1] -> 8[0] [send] via NET/Socket/0[32m [repeated 41x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [receive] via NET/Socket/0[32m [repeated 26x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.117)[0m ini-worker-ray-maleksk-bern[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m | Name | Type | Params | Params per Device | Mode
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m -----------------------------------------------------------------------
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m 0 | model | LlamaForCausalLM | 6.7 B | 421 M | train
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m -----------------------------------------------------------------------
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m 6.7 B Trainable params
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m 0 Non-trainable params
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m 6.7 B Total params
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m 26,953.794Total estimated model params size (MB)
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m Loading extension module cpu_adam...[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=947, ip=10.67.132.117)[0m [rank11]:[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Training: | | 0/? [00:00<?, ?it/s]
Training: 0%| | 0/24 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/24 [00:00<?, ?it/s]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=23` in the `DataLoader` to improve performance.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (24) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 4%|▍ | 1/24 [00:44<16:53, 0.02it/s]
Epoch 0: 4%|▍ | 1/24 [00:44<16:53, 0.02it/s, v_num=0_1, train_loss_step=2.310]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 8%|▊ | 2/24 [01:14<13:38, 0.03it/s, v_num=0_1, train_loss_step=2.310]
Epoch 0: 8%|▊ | 2/24 [01:14<13:38, 0.03it/s, v_num=0_1, train_loss_step=2.290]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 12%|█▎ | 3/24 [01:44<12:13, 0.03it/s, v_num=0_1, train_loss_step=2.290]
Epoch 0: 12%|█▎ | 3/24 [01:44<12:13, 0.03it/s, v_num=0_1, train_loss_step=2.290]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 17%|█▋ | 4/24 [02:15<11:17, 0.03it/s, v_num=0_1, train_loss_step=2.290]
Epoch 0: 17%|█▋ | 4/24 [02:15<11:17, 0.03it/s, v_num=0_1, train_loss_step=2.330]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 21%|██ | 5/24 [02:45<10:29, 0.03it/s, v_num=0_1, train_loss_step=2.330]
Epoch 0: 21%|██ | 5/24 [02:45<10:29, 0.03it/s, v_num=0_1, train_loss_step=2.350]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 25%|██▌ | 6/24 [03:16<09:49, 0.03it/s, v_num=0_1, train_loss_step=2.350]
Epoch 0: 25%|██▌ | 6/24 [03:16<09:49, 0.03it/s, v_num=0_1, train_loss_step=2.360]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 29%|██▉ | 7/24 [03:46<09:10, 0.03it/s, v_num=0_1, train_loss_step=2.360]
Epoch 0: 29%|██▉ | 7/24 [03:46<09:10, 0.03it/s, v_num=0_1, train_loss_step=2.270]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 33%|███▎ | 8/24 [04:17<08:34, 0.03it/s, v_num=0_1, train_loss_step=2.270]
Epoch 0: 33%|███▎ | 8/24 [04:17<08:34, 0.03it/s, v_num=0_1, train_loss_step=2.460]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 38%|███▊ | 9/24 [04:47<07:59, 0.03it/s, v_num=0_1, train_loss_step=2.460]
Epoch 0: 38%|███▊ | 9/24 [04:47<07:59, 0.03it/s, v_num=0_1, train_loss_step=2.630]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 42%|████▏ | 10/24 [05:17<07:24, 0.03it/s, v_num=0_1, train_loss_step=2.630]
Epoch 0: 42%|████▏ | 10/24 [05:17<07:24, 0.03it/s, v_num=0_1, train_loss_step=2.350]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 46%|████▌ | 11/24 [05:48<06:51, 0.03it/s, v_num=0_1, train_loss_step=2.350]
Epoch 0: 46%|████▌ | 11/24 [05:48<06:51, 0.03it/s, v_num=0_1, train_loss_step=2.230]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 50%|█████ | 12/24 [06:18<06:18, 0.03it/s, v_num=0_1, train_loss_step=2.230]
Epoch 0: 50%|█████ | 12/24 [06:18<06:18, 0.03it/s, v_num=0_1, train_loss_step=2.280]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 54%|█████▍ | 13/24 [06:49<05:46, 0.03it/s, v_num=0_1, train_loss_step=2.280]
Epoch 0: 54%|█████▍ | 13/24 [06:49<05:46, 0.03it/s, v_num=0_1, train_loss_step=2.200]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 58%|█████▊ | 14/24 [07:19<05:13, 0.03it/s, v_num=0_1, train_loss_step=2.200]
Epoch 0: 58%|█████▊ | 14/24 [07:19<05:13, 0.03it/s, v_num=0_1, train_loss_step=2.420]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 62%|██████▎ | 15/24 [07:49<04:41, 0.03it/s, v_num=0_1, train_loss_step=2.420]
Epoch 0: 62%|██████▎ | 15/24 [07:49<04:41, 0.03it/s, v_num=0_1, train_loss_step=2.380]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 67%|██████▋ | 16/24 [08:20<04:10, 0.03it/s, v_num=0_1, train_loss_step=2.380]
Epoch 0: 67%|██████▋ | 16/24 [08:20<04:10, 0.03it/s, v_num=0_1, train_loss_step=2.240]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 71%|███████ | 17/24 [08:50<03:38, 0.03it/s, v_num=0_1, train_loss_step=2.240]
Epoch 0: 71%|███████ | 17/24 [08:50<03:38, 0.03it/s, v_num=0_1, train_loss_step=2.210]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 75%|███████▌ | 18/24 [09:20<03:06, 0.03it/s, v_num=0_1, train_loss_step=2.210]
Epoch 0: 75%|███████▌ | 18/24 [09:20<03:06, 0.03it/s, v_num=0_1, train_loss_step=2.520]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 79%|███████▉ | 19/24 [09:51<02:35, 0.03it/s, v_num=0_1, train_loss_step=2.520]
Epoch 0: 79%|███████▉ | 19/24 [09:51<02:35, 0.03it/s, v_num=0_1, train_loss_step=2.410]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 83%|████████▎ | 20/24 [10:21<02:04, 0.03it/s, v_num=0_1, train_loss_step=2.410]
Epoch 0: 83%|████████▎ | 20/24 [10:21<02:04, 0.03it/s, v_num=0_1, train_loss_step=2.570]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 88%|████████▊ | 21/24 [10:51<01:33, 0.03it/s, v_num=0_1, train_loss_step=2.570]
Epoch 0: 88%|████████▊ | 21/24 [10:51<01:33, 0.03it/s, v_num=0_1, train_loss_step=2.320]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 92%|█████████▏| 22/24 [11:21<01:01, 0.03it/s, v_num=0_1, train_loss_step=2.320]
Epoch 0: 92%|█████████▏| 22/24 [11:21<01:01, 0.03it/s, v_num=0_1, train_loss_step=2.400]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 96%|█████████▌| 23/24 [11:52<00:30, 0.03it/s, v_num=0_1, train_loss_step=2.400]
Epoch 0: 96%|█████████▌| 23/24 [11:52<00:30, 0.03it/s, v_num=0_1, train_loss_step=2.410]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 100%|██████████| 24/24 [12:22<00:00, 0.03it/s, v_num=0_1, train_loss_step=2.410]
Epoch 0: 100%|██████████| 24/24 [12:22<00:00, 0.03it/s, v_num=0_1, train_loss_step=2.290]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Syncing files from /tmp/ray_results/demo_llama2_20240723-154138/lightning to s3://ray-training-output-ue1/demo_llama2_20240723-154138/lightning
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m The user-provided path /tmp/ray_results/demo_llama2_20240723-154138/lightning does not exist.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Time Taken to Sync: 0.581810712814331
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/strategies/deepspeed.py:634: When saving the DeepSpeed Stage 3 checkpoint, each worker will save a shard of the checkpoint within a directory. If a single file is required after training, see https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#deepspeed-zero-stage-3-single-file for instructions.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Checkpoint successfully created at: Checkpoint(filesystem=py::fsspec+s3, path=ray-training-output-ue1/demo_llama2_20240723-154138/TorchTrainer_0e52d_00000_0_2024-07-23_15-41-41/checkpoint_000000)
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m The user-provided path /tmp/ray_results/demo_llama2_20240723-154138/lightning does not exist.[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=957, ip=10.67.132.202)[0m Checkpoint successfully created at: Checkpoint(filesystem=py::fsspec+s3, path=ray-training-output-ue1/demo_llama2_20240723-154138/TorchTrainer_0e52d_00000_0_2024-07-23_15-41-41/checkpoint_000000)[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=948, ip=10.67.132.117)[0m Checkpoint successfully created at: Checkpoint(filesystem=py::fsspec+s3, path=ray-training-output-ue1/demo_llama2_20240723-154138/TorchTrainer_0e52d_00000_0_2024-07-23_15-41-41/checkpoint_000000)[32m [repeated 10x across cluster][0m
Training finished iteration 1 at 2024-07-23 16:17:50. Total running time: 36min 9s
╭─────────────────────────────────────────╮
│ Training result │
├─────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 1142.8118 │
│ time_total_s 1142.8118 │
│ training_iteration 1 │
│ epoch 0 │
│ step 24 │
│ train_loss 2.3574 │
│ train_loss_epoch 2.3574 │
│ train_loss_step 2.29294 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 1 at: (py::fsspec+s3)ray-training-output-ue1/demo_llama2_20240723-154138/TorchTrainer_0e52d_00000_0_2024-07-23_15-41-41/checkpoint_000000
2024-07-23 16:17:51,007 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Syncing files from /tmp/ray_results/demo_llama2_20240723-154138/lightning to s3://ray-training-output-ue1/demo_llama2_20240723-154138/lightning[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Time Taken to Sync: 0.5941030979156494
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m The user-provided path /tmp/ray_results/demo_llama2_20240723-154138/lightning does not exist.
[36m(RayTrainWorker pid=949, ip=10.67.132.117)[0m Checkpoint successfully created at: Checkpoint(filesystem=py::fsspec+s3, path=ray-training-output-ue1/demo_llama2_20240723-154138/TorchTrainer_0e52d_00000_0_2024-07-23_15-41-41/checkpoint_000000)
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m Time Taken to Sync: 0.5892753601074219
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
Epoch 0: 100%|██████████| 24/24 [15:53<00:00, 0.03it/s, v_num=0_1, train_loss_step=2.290, train_loss_epoch=2.360]
Epoch 0: 100%|██████████| 24/24 [15:53<00:00, 0.03it/s, v_num=0_1, train_loss_step=2.290, train_loss_epoch=2.360]
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m /opt/miniconda/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:382: `ModelCheckpoint(monitor='val_loss')` could not find the monitored key in the returned metrics: ['train_loss', 'train_loss_step', 'train_loss_epoch', 'epoch', 'step']. HINT: Did you call `log('val_loss', value)` in the `LightningModule`?
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m `Trainer.fit` stopped: `max_epochs=1` reached.
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m
Training completed after 1 iterations at 2024-07-23 16:17:58. Total running time: 36min 16s
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m y-worker-p5:950:3833 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Connected all trees
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO NVLS comm 0x7f2529911320 headRank 0 nHeads 1 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 335544320
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO Connected NVLS tree
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
[36m(RayTrainWorker pid=950, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:950:3833 [0] NCCL INFO comm 0x7f2529911320 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 53000 commId 0xb2d16673d0abaf0a - Init COMPLETE
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.2102)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=951, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:951:3842 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM
[36m(RayTrainWorker pid=945, ip=10.67.132.117)[0m O Channel 05/0 : 9[1] -> 10[2] via P2P/CUMEM
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Channel 10/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[36m(RayTrainWorker pid=946, ip=10.67.132.117)[0m 6/0 : 10[2] -> 11[3] via P2P/CUMEM
2024-07-23 16:17:59,007 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'ray-training-output-ue1/demo_llama2_20240723-154138' in 0.6700s.
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Syncing files from /tmp/ray_results/demo_llama2_20240723-154138/lightning to s3://ray-training-output-ue1/demo_llama2_20240723-154138/lightning
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m Time Taken to Sync: 0.5883631706237793
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m ini-worker-ray-worker-p5:944:3818 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [send] via NET/Socket/0[32m [repeated 16x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO Connected all trees[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO NVLS comm 0x7f45efbb3850 headRank -1 nHeads 1 buffSize 8388608 memSize 2097152 nvlsPerRankSize 335544320 nvlsTotalSize 335544320[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO Connected NVLS tree[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO comm 0x7f45efbb3850 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 97000 commId 0xb2d16673d0abaf0a - Init COMPLETE[32m [repeated 14x across cluster][0m
[36m(RayTrainWorker pid=955, ip=10.67.132.202)[0m ini-worker-ray-worker-p5:955:3838 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM[32m [repeated 115x across cluster][0m
[36m(RayTrainWorker pid=944, ip=10.67.132.117)[0m The user-provided path /tmp/ray_results/demo_llama2_20240723-154138/lightning does not exist.[32m [repeated 3x across cluster][0m
Hi @awsankur,
Thanks for the links, I believed that docker I created were similar, I will try again and will update you soon
In the full log above, I see
�[36m(RayTrainWorker pid=950, ip=10.67.132.202)�[0m ini-worker-ray-worker-p5:950:3034 [0] nccl_net_ofi_rdma_init:5966 NCCL WARN NET/OFI Wrong number of NICs for device 1. Expected 4 but got 3
Which means that not all 32 EFA devices are available to the aws-ofi-nccl plugin. It would be helpful to know how the containers are being launched, and specifically whether they are being launched with all 32 EFA devices.
To check if all 32 EFA devices are being used, describe the node and you should see in the Allocatable section
vpc.amazonaws.com/efa:32.
Yes that is correct. We are assigning 30 EFA devices, when we assign all 32 devices, the fails loose internet connection. We are launching container using KubeRay. Here is the resources request by pods:
resources:
limits: &id002
cpu: 190
memory: 1900Gi
vpc.amazonaws.com/efa: 30
hugepages-2Mi: 5120Mi
nvidia.com/gpu: 8
requests: *id002
Do we need to make all 32 devices available to the pod?
@kamal-rahimi Yes, the only supported configuration of the aws-ofi-nccl plugin on P5 is using all 32 EFAs.
Not sure why you lose Internet connection when using all 32 devices; that's not expected.
Thank you for the quick response. It is working now
The problem is resolved. Thank you
We have enabled EFA on our Kuberentes cluster and the ncc-tests in this repo are passing without issue and we see bus bandwidth of upro 20Gbps on
P5.48xlarge
instance.However, when using PyTorch NCCL fails to configure EFA with this error:
nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
We have been trying this docker image: https://github.com/aws-samples/awsome-distributed-training/blob/main/2.ami_and_containers/containers/pytorch/0.nvcr-pytorch-aws.dockerfile
Also, when we run nccl-tests using the following docker, the tests are passing: https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile
But when we install PyTorch, we see the error:
NET/OFI aws-ofi-nccl initialization failed
Here is more info in the log: