TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.23k stars 242 forks source link

Horovod broadcast #172

Closed yaruzz closed 3 years ago

yaruzz commented 3 years ago

Thanks for the impressive work! I have some questions about Horovod. How to ensure all ranks initialized with the same weight? I don't find the process just like calling function "hvd.BroadcastGlobalVariablesHook". It would be very kind of you if you can help me here.

The following is the code for Horovod in ./packnet_sfm/trainers/horovod_trainer.py : ` def fit(self, module):

    # Prepare module for training
    module.trainer = self
    # Update and print module configuration
    prep_logger_and_checkpoint(module)
    print_config(module.config)

    # Send module to GPU
    module = module.to('cuda')
    # Configure optimizer and scheduler
    module.configure_optimizers()

    # Create distributed optimizer
    compression = hvd.Compression.none
    optimizer = hvd.DistributedOptimizer(module.optimizer,
        named_parameters=module.named_parameters(), compression=compression)
    scheduler = module.scheduler

    # Get train and val dataloaders
    train_dataloader = module.train_dataloader()
    val_dataloaders = module.val_dataloader()

    # Validate before training if requested
    if self.validate_first:
        validation_output = self.validate(val_dataloaders, module)
        self.check_and_save(module, validation_output)

`