RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.29k stars 594 forks source link

[🐛BUG] 分布式训练后的数据加载问题 #2063

Open zw81929 opened 2 weeks ago

zw81929 commented 2 weeks ago

报错信息如下

Traceback (most recent call last):
  File "/data1/bert4rec/bert4rec-main/scripts/bole/loaddata_run_product.py", line 5, in <module>
    config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
                                                                ^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/quick_start/quick_start.py", line 259, in load_data_and_model
    train_data, valid_data, test_data = data_preparation(config, dataset)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/utils.py", line 174, in data_preparation
    train_data = get_dataloader(config, "train")(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/dataloader/general_dataloader.py", line 45, in __init__
    super().__init__(config, dataset, sampler, shuffle=shuffle)
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/dataloader/abstract_dataloader.py", line 130, in __init__
    super().__init__(config, dataset, sampler, shuffle=shuffle)
  File "/data1/bert4rec/bert4rec-main/scripts/bole/recbole/data/dataloader/abstract_dataloader.py", line 60, in __init__
    index_sampler = torch.utils.data.distributed.DistributedSampler(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/utils/data/distributed.py", line 68, in __init__
    num_replicas = dist.get_world_size()
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1769, in get_world_size
    return _get_group_size(group)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 841, in _get_group_size
    default_pg = _get_default_group()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/data1/bert4rec/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

这块

zw81929 commented 2 weeks ago

abstract_dataloader.py 中如下部分的代码可能是有问题,self.sample_size 没有初始化

    def __init__(self, config, dataset, sampler, shuffle=False):
        self.shuffle = shuffle
        self.config = config
        self._dataset = dataset
        self._sampler = sampler
        self._batch_size = self.step = self.model = None
        self._init_batch_size_and_step()
        index_sampler = None
        self.generator = torch.Generator()
        self.generator.manual_seed(config["seed"])
        self.transform = construct_transform(config)
        self.is_sequential = config["MODEL_TYPE"] == ModelType.SEQUENTIAL

        if not config["single_spec"]:
            index_sampler = torch.utils.data.distributed.DistributedSampler(
                list(range(self.sample_size)), shuffle=shuffle, drop_last=False
            )
            self.step = max(1, self.step // config["world_size"])
            shuffle = False
        super().__init__(
            dataset=list(range(self.sample_size)),
            batch_size=self.step,
            collate_fn=self.collate_fn,
            num_workers=config["worker"],
            shuffle=shuffle,
            sampler=index_sampler,
            generator=self.generator,
        )