NVlabs / LSGM

The Official PyTorch Implementation of "LSGM: Score-based Generative Modeling in Latent Space" (NeurIPS 2021)
Other
340 stars 49 forks source link

running evaluate_vada.py, RuntimeError: Address already in use #16

Open fikry102 opened 8 months ago

fikry102 commented 8 months ago

--master_address \${NGC_MASTER_ADDR}

how to set ${NGC_MASTER_ADDR}?

envs/lsgm/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
fikry102 commented 8 months ago

I modified line 687 in util/utils.py. And the bug was fixed.

    # os.environ['MASTER_PORT'] = '6020'
    os.environ['MASTER_PORT'] = '6030'
10/24 08:09:47 PM (Elapsed: 00:00:00) loading the model at:
10/24 08:09:47 PM (Elapsed: 00:00:00) checkpoints/cifar10_vae2/lsgm2/checkpoint_fid.pt
10/24 08:09:48 PM (Elapsed: 00:00:01) loaded the model at epoch 1625
10/24 08:09:48 PM (Elapsed: 00:00:01) args = Namespace(apply_sqrt2_res=True, arch_instance='res_bnswish', arch_instance_dae='res_ho_attn', autocast_eval=True, autocast_train=True, batch_size=16, beta_end=20.0, beta_start=0.1, channel_mult=[1, 2], cont_kl_anneal=False, cont_training=True, dae_arch='ncsnpp', data='/workspace/kk-vaeflow/datasets/cifar10', dataset='cifar10', decoder_dist='dml', denoising_stddevs='beta', diffusion_steps=1000, discard_dae_weights=False, discard_vae_weights=False, disjoint_training=False, distributed=True, drop_inactive_var=False, dropout=0.2, ema_decay=0.9999, embedding_dim=128, embedding_scale=1000.0, embedding_type='positional', epochs=2500, eval_ode_eps=1e-06, eval_ode_solver_tol=1e-05, fid_dir='/workspace/kk-vaeflow/fid-stats', fir=True, global_rank=0, grad_clip_max_norm=0.0, iw_sample_p='drop_all_iw', iw_sample_q='reweight_p_samples', iw_subvp_like_vp_sde=True, jac_kin_reg_drop_weights=True, jac_reg_coeff=0.0, jac_reg_freq=3, jac_reg_samples=1, kin_reg_coeff=0.0, kl_anneal_portion_vada=0.1, kl_balance_vada=False, kl_const_coeff_vada=0.7, kl_const_portion_vada=0.0, kl_max_coeff_vada=1.0, latent_grad_cutoff=0.0, learning_rate_dae=0.0001, learning_rate_min_dae=0.0001, learning_rate_min_vae=1e-05, learning_rate_vae=0.0001, local_rank=0, log_sig_q_scale=5.0, master_address='localhost', mixed_prediction=True, mixing_logit_init=-3, model_selection_criterion='fid', model_type='vada', no_autograd_jvp=False, node_rank=0, num_cell_per_cond_dec=2, num_cell_per_cond_enc=2, num_cell_per_scale_dae=8, num_channels_dae=512, num_channels_dec=128, num_channels_enc=128, num_groups_per_scale=20, num_latent_per_group=9, num_latent_scales=1, num_nf=0, num_postprocess_blocks=1, num_postprocess_cells=2, num_preprocess_blocks=1, num_preprocess_cells=2, num_proc_node=2, num_process_per_node=8, num_scales_dae=3, num_total_iter=487500, num_x_bits=8, progressive='none', progressive_combine='sum', progressive_input='none', progressive_input_vae='none', progressive_output_vae='none', root='/workspace/kk-vaeflow/nvae-diff/', save='/workspace/kk-vaeflow/nvae-diff//vada/RUNS_11_big/vada_big_ncsnpp_trainvae2_subvpsde_20gr_07KL_dlr-1e-4_teps-0.01_iwp-drop_all_iw_iwq-reweight_p_samples', sde_type='vpsde', seed=2, sigma2_0=0.0, sigma2_max=0.999, sigma2_min=3e-05, skip_final_eval=False, time_eps=0.01, train_ode_eps=1e-06, train_ode_solver_tol=1e-05, train_vae=True, update_q_ema=False, use_adamax=False, use_se=True, vada_checkpoint='', vae_checkpoint='/workspace/kk-vaeflow/nvae-diff/vae/RUNS_4/vae_2encdeccells_20gr_klmaxc-0.7_lgc-0.0/checkpoint.pt', vpsde_power=2, warmup_epochs=5, weight_decay=0.0003, weight_decay_norm_dae=0.01, weight_decay_norm_vae=0.01)
10/24 08:09:48 PM (Elapsed: 00:00:01) evalargs = Namespace(batch_size=32, checkpoint='checkpoints/cifar10_vae2/lsgm2/checkpoint_fid.pt', data='data/cifar10', diffusion_steps=0, distributed=True, elbo_eval=False, eval_mode='evaluate', eval_on_train=False, fid_dir='fid_stats_dir', fid_disc_eval=False, fid_ode_eval=True, global_rank=0, local_rank=0, master_address='127.0.0.1', nfe_eval=False, nll_ode_eval=True, node_rank=0, num_fid_samples=50000, num_iw_inner_samples=1, num_iw_samples=1, num_proc_node=1, num_process_per_node=2, ode_eps=1e-06, ode_sampling=False, ode_solver_tol=1e-05, readjust_bn=False, root='checkpoints', save='checkpoints/cifar10_vae2/eval', seed=1, temp=1.0, vae_temp=1.0, vae_train_mode=False)
10/24 08:09:52 PM (Elapsed: 00:00:05) VAE: param size = 100.867740M