Closed adjgiulio closed 4 years ago
This will be fixed once #352 lands.
Latest master branch update fixes the error "has no attribute 'is_nccl_available'".
Now running: mmf_run config=projects/hateful_memes/configs/mmbt/defaults.yaml model=mmbt dataset=hateful_memes
results in error: AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
Can you provide a stack trace?
Is this what you mean?
(gpu) C:\Users\giuliano>mmf_run config=projects/hateful_memes/configs/unimodal/image.yaml model=unimodal_image dataset=hateful_memes
2020-06-25 19:32:07.910907: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Namespace(config_override=None, local_rank=None, opts=['config=projects/hateful_memes/configs/unimodal/image.yaml', 'model=unimodal_image', 'dataset=hateful_memes'])
C:\Users\giuliano\anaconda3\envs\gpu\lib\site-packages\omegaconf\dictconfig.py:252: UserWarning: Keys with dot (../../../others/unimodal/configs/hateful_memes/image.yaml) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
warnings.warn(message=msg, category=UserWarning)
Overriding option config to projects/hateful_memes/configs/unimodal/image.yaml
Overriding option model to unimodal_image
Overriding option datasets to hateful_memes
Using seed 11962886
Logging to: ./save\logs\train_2020_06_25T19_32_11.log
Traceback (most recent call last):
File "C:\Users\giuliano\anaconda3\envs\gpu\Scripts\mmf_run-script.py", line 33, in <module>
sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
File "c:\users\giuliano\mmf\mmf_cli\run.py", line 111, in run
main(configuration, predict=predict)
File "c:\users\giuliano\mmf\mmf_cli\run.py", line 40, in main
trainer.load()
File "c:\users\giuliano\mmf\mmf\trainers\base_trainer.py", line 59, in load
self.load_datasets()
File "c:\users\giuliano\mmf\mmf\trainers\base_trainer.py", line 83, in load_datasets
self.dataset_loader.load_datasets()
File "c:\users\giuliano\mmf\mmf\common\dataset_loader.py", line 17, in load_datasets
self.train_dataset.load(self.config)
File "c:\users\giuliano\mmf\mmf\datasets\multi_dataset_loader.py", line 115, in load
self.build_dataloaders()
File "c:\users\giuliano\mmf\mmf\datasets\multi_dataset_loader.py", line 149, in build_dataloaders
dataset_instance, self.config.training
File "c:\users\giuliano\mmf\mmf\utils\build.py", line 133, in build_dataloader_and_sampler
other_args = _add_extra_args_for_dataloader(dataset_instance, other_args)
File "c:\users\giuliano\mmf\mmf\utils\build.py", line 169, in _add_extra_args_for_dataloader
if torch.distributed.is_initialized():
AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
I still get the same error. Is this fixed?
Pytorch Windows doesn't have distributed backend available before version 1.7. We haven't tested it with 1.7 on Windows, you can try installing 1.7 and see if it works for you.
I tried installing PyTorch 1.7 on windows but it leads to version errors with mmf. I have found a work around by putting lines 32 and 33 of ./mmf/mmf_cli/run.py
in an if statement:
if torch.distributed.is_available():
if init_distributed:
distributed_init(config)
and doing the same for lines 52-60 in ./mmf/mmf/trainers/core/device.py
:
if torch.distributed.is_available():
if "cuda" in str(self.device) and self.distributed:
registry.register("distributed", True)
self.model = torch.nn.parallel.DistributedDataParallel(
self.model,
device_ids=[self.local_rank],
output_device=self.local_rank,
check_reduction=True,
find_unused_parameters=self.config.training.find_unused_parameters,
)