OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
560 stars 77 forks source link

stuck during synchronize #84

Closed Smu-Tan closed 1 year ago

Smu-Tan commented 1 year ago

Hi,

When using BMCOOK with BMTrain I encountered a bug that the second bmtrain.synchronize() is always stuck. Do you probably have any ideas?

Below is the code:

import os
import json
import torch
import random
import time
import bmtrain as bmt
from data import MMapIndexedDataset, Dataset
from bmcook import CookTrainer
from bmcook.utils.config import ConfigParser
from bmcook.utils.arguments import parse_args
from pathlib import Path

bmt.init_distributed()
args = parse_args()
save_dir = Path(args.save_dir)
ckpt_dir = save_dir / 'checkpoints'
os.makedirs(ckpt_dir, exist_ok=True)
json.dump(vars(args), open(save_dir / 'train_args.json', 'w'), indent=2)
model_config = config_map[args.model].from_pretrained(args.model)
model = model_map[args.model].from_pretrained(args.model, config=model_config)
# teacher model has the same config as the student model
teacher = model_map[args.model].from_pretrained(args.model, config=model_config)
bmt.synchronize() #this works

...

CookTrainer.set_compression(config, model, optimizer, teacher)    #this step uses another bmt.synchronize() where I stuck 
a710128 commented 1 year ago

I have no idea. Did you find out which rank was stuck?

Smu-Tan commented 1 year ago

I have no idea. Did you find out which rank was stuck?

I used two cards and both of them were stuck. Below are some warnings.

master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:30123 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:30123 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/synchronize.py:15: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  nccl.allReduce(barrier.storage(), barrier.storage(), 'sum', config['comm'])
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/synchronize.py:15: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  nccl.allReduce(barrier.storage(), barrier.storage(), 'sum', config['comm'])
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:109: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert src.is_cuda and dst.is_cuda
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:111: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  sendbuff = src.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:112: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  recvbuff = dst.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:113: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  count = src.size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:117: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert src.size() == dst.size(), "Buffer size not aligned"
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:109: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert src.is_cuda and dst.is_cuda
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:111: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  sendbuff = src.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:112: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  recvbuff = dst.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:113: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  count = src.size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:117: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert src.size() == dst.size(), "Buffer size not aligned"
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  cuda_storage = cuda_tensor.storage_type()(cuda_storage_size)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  cuda_storage = cuda_tensor.storage_type()(cuda_storage_size)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/torch/storage.py:959: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type not in ['cpu', 'cuda']:
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/torch/storage.py:962: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  module = torch if self.device.type == 'cpu' else torch.cuda
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/torch/storage.py:959: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type not in ['cpu', 'cuda']:
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/torch/storage.py:962: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  module = torch if self.device.type == 'cpu' else torch.cuda
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:333: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_type = storage_type_cuda(param.storage_type())
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:333: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_type = storage_type_cuda(param.storage_type())
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:364: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_param_buffer = storage_type(partition_size)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:367: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  device = storage_param_buffer.device
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:364: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_param_buffer = storage_type(partition_size)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:367: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  device = storage_param_buffer.device
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:95: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  partition_size = value.storage().size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = value.storage_type()(global_size)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:95: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  partition_size = value.storage().size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:101: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  value.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:218: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert src.is_cuda and dst.is_cuda
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = value.storage_type()(global_size)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:220: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  sendbuff = src.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  recvbuff = dst.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:222: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  sendcount = src.size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:225: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert dst.size() % sendcount == 0, "Buffer size not aligned"
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/parameter.py:101: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  value.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:218: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert src.is_cuda and dst.is_cuda
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:220: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  sendbuff = src.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  recvbuff = dst.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:222: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  sendcount = src.size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:152: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  recvbuff = dst.data_ptr()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:153: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  count = src.size()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/nccl/__init__.py:156: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert dst.size() == src.size(), "Buffer size not aligned"
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:88: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  byte_tensor.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:89: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  byte_tensor.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:104: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  byte_tensor.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:105: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  byte_tensor.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  tmp_shape.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:134: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  tmp_shape.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:133: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  tmp_shape.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:134: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  tmp_shape.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:160: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  output_param.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:161: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  output_param.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:153: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  input_param.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/store.py:154: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  output_param.storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:508: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  torch.tensor([], dtype=d_dtype, device=d_device).set_(contiguous_param.storage(), offset_st, (offset_end - offset_st,))[:]
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:507: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  torch.tensor([], dtype=d_dtype, device=d_device).set_(self._storage_params[kw_name].storage(), to_offset_st, (to_offset_end - to_offset_st,))[:] = \
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:508: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  torch.tensor([], dtype=d_dtype, device=d_device).set_(contiguous_param.storage(), offset_st, (offset_end - offset_st,))[:]
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:507: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  torch.tensor([], dtype=d_dtype, device=d_device).set_(self._storage_params[kw_name].storage(), to_offset_st, (to_offset_end - to_offset_st,))[:] = \
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:151: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_type = local_param.storage_type()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:153: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self._param_buffer[kw] = storage_type(val["partition_size"] * config["world_size"])
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:151: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_type = local_param.storage_type()
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:153: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self._param_buffer[kw] = storage_type(val["partition_size"] * config["world_size"])
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:154: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self._param_tensor[kw] = torch.tensor([], dtype=self._param_buffer[kw].dtype, device=self._param_buffer[kw].device).set_(self._param_buffer[kw])
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:154: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self._param_tensor[kw] = torch.tensor([], dtype=self._param_buffer[kw].dtype, device=self._param_buffer[kw].device).set_(self._param_buffer[kw])
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:163: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self.block._storage_params[kw].storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:163: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self.block._storage_params[kw].storage(),
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:187: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  device = self._param_buffer[kw_name].device
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:187: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  device = self._param_buffer[kw_name].device
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:257: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  param["parameter"].data = torch.tensor([], dtype=dtype, device=device).set_(self.block._storage_params[kw_name].storage(), begin, end)
/home/stan1/anaconda3/envs/pruning/lib/python3.9/site-packages/bmtrain/block_layer.py:257: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  param["parameter"].data = torch.tensor([], dtype=dtype, device=device).set_(self.block._storage_params[kw_name].storage(), begin, end)
a710128 commented 1 year ago
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:30123 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:30123 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

It looks like torchrun didn't start successfully.

Smu-Tan commented 1 year ago
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:30123 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:30123 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

It looks like torchrun didn't start successfully.

Could be. However, I can run Torch DDP stuff based on dist.init_process_group() with similar errors (see below) but successfully. Here's a toy example.

/var/spool/slurm/job564246/slurm_script: line 15: conda: command not found
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115.ivi_ilps.local]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115.ivi_ilps.local]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115-d.ivi_ilps.data]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115-d.ivi_ilps.data]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.70.215]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.70.215]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115.ivi_ilps.local]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115-d.ivi_ilps.data]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.70.215]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115.ivi_ilps.local]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ilps-cn115-d.ivi_ilps.data]:3456 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.70.215]:3456 (errno: 97 - Address family not supported by protocol).