facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

AttributeError: module 'torch.distributed' has no attribute 'is_nccl_available' #353

Closed adjgiulio closed 4 years ago

adjgiulio commented 4 years ago
(gpu) C:\Users\abc\Documents\DataScience\memes>mmf_run config=projects/hateful_memes/configs/unimodal/image.yaml model=unimodal_image dataset=hateful_memes
2020-06-23 17:15:20.787247: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Namespace(config_override=None, local_rank=None, opts=['config=projects/hateful_memes/configs/unimodal/image.yaml', 'model=unimodal_image', 'dataset=hateful_memes'])
C:\Users\giuliano\anaconda3\envs\gpu\lib\site-packages\omegaconf\dictconfig.py:252: UserWarning: Keys with dot (../../../others/unimodal/configs/hateful_memes/image.yaml) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
  warnings.warn(message=msg, category=UserWarning)
Overriding option config to projects/hateful_memes/configs/unimodal/image.yaml
Overriding option model to unimodal_image
Overriding option datasets to hateful_memes
Using seed 24744504
Traceback (most recent call last):
  File "C:\Users\giuliano\anaconda3\envs\gpu\Scripts\mmf_run-script.py", line 33, in <module>
    sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
  File "c:\users\giuliano\documents\datascience\memes\mmf\mmf_cli\run.py", line 111, in run
    main(configuration, predict=predict)
  File "c:\users\giuliano\documents\datascience\memes\mmf\mmf_cli\run.py", line 38, in main
    registry.register("writer", Logger(config, name="mmf.train"))
  File "c:\users\giuliano\documents\datascience\memes\mmf\mmf\utils\logger.py", line 19, in __init__
    self._is_master = is_master()
  File "c:\users\giuliano\documents\datascience\memes\mmf\mmf\utils\distributed.py", line 39, in is_master
    return get_rank() == 0
  File "c:\users\giuliano\documents\datascience\memes\mmf\mmf\utils\distributed.py", line 31, in get_rank
    if not dist.is_nccl_available():
AttributeError: module 'torch.distributed' has no attribute 'is_nccl_available'
vedanuj commented 4 years ago

This will be fixed once #352 lands.

adjgiulio commented 4 years ago

Latest master branch update fixes the error "has no attribute 'is_nccl_available'".

Now running: mmf_run config=projects/hateful_memes/configs/mmbt/defaults.yaml model=mmbt dataset=hateful_memes

results in error: AttributeError: module 'torch.distributed' has no attribute 'is_initialized'

vedanuj commented 4 years ago

Can you provide a stack trace?

adjgiulio commented 4 years ago

Is this what you mean?

(gpu) C:\Users\giuliano>mmf_run config=projects/hateful_memes/configs/unimodal/image.yaml model=unimodal_image dataset=hateful_memes
2020-06-25 19:32:07.910907: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Namespace(config_override=None, local_rank=None, opts=['config=projects/hateful_memes/configs/unimodal/image.yaml', 'model=unimodal_image', 'dataset=hateful_memes'])
C:\Users\giuliano\anaconda3\envs\gpu\lib\site-packages\omegaconf\dictconfig.py:252: UserWarning: Keys with dot (../../../others/unimodal/configs/hateful_memes/image.yaml) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
  warnings.warn(message=msg, category=UserWarning)
Overriding option config to projects/hateful_memes/configs/unimodal/image.yaml
Overriding option model to unimodal_image
Overriding option datasets to hateful_memes
Using seed 11962886
Logging to: ./save\logs\train_2020_06_25T19_32_11.log
Traceback (most recent call last):
  File "C:\Users\giuliano\anaconda3\envs\gpu\Scripts\mmf_run-script.py", line 33, in <module>
    sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
  File "c:\users\giuliano\mmf\mmf_cli\run.py", line 111, in run
    main(configuration, predict=predict)
  File "c:\users\giuliano\mmf\mmf_cli\run.py", line 40, in main
    trainer.load()
  File "c:\users\giuliano\mmf\mmf\trainers\base_trainer.py", line 59, in load
    self.load_datasets()
  File "c:\users\giuliano\mmf\mmf\trainers\base_trainer.py", line 83, in load_datasets
    self.dataset_loader.load_datasets()
  File "c:\users\giuliano\mmf\mmf\common\dataset_loader.py", line 17, in load_datasets
    self.train_dataset.load(self.config)
  File "c:\users\giuliano\mmf\mmf\datasets\multi_dataset_loader.py", line 115, in load
    self.build_dataloaders()
  File "c:\users\giuliano\mmf\mmf\datasets\multi_dataset_loader.py", line 149, in build_dataloaders
    dataset_instance, self.config.training
  File "c:\users\giuliano\mmf\mmf\utils\build.py", line 133, in build_dataloader_and_sampler
    other_args = _add_extra_args_for_dataloader(dataset_instance, other_args)
  File "c:\users\giuliano\mmf\mmf\utils\build.py", line 169, in _add_extra_args_for_dataloader
    if torch.distributed.is_initialized():
AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
chongduan commented 4 years ago

I still get the same error. Is this fixed?

apsdehal commented 4 years ago

Pytorch Windows doesn't have distributed backend available before version 1.7. We haven't tested it with 1.7 on Windows, you can try installing 1.7 and see if it works for you.

JoshuaPlacidi commented 3 years ago

I tried installing PyTorch 1.7 on windows but it leads to version errors with mmf. I have found a work around by putting lines 32 and 33 of ./mmf/mmf_cli/run.py in an if statement:

if torch.distributed.is_available():
    if init_distributed:
        distributed_init(config)

and doing the same for lines 52-60 in ./mmf/mmf/trainers/core/device.py:

if torch.distributed.is_available():
    if "cuda" in str(self.device) and self.distributed:
        registry.register("distributed", True)
        self.model = torch.nn.parallel.DistributedDataParallel(
            self.model,
            device_ids=[self.local_rank],
            output_device=self.local_rank,
            check_reduction=True,
            find_unused_parameters=self.config.training.find_unused_parameters,
        )