Mikubill / naifu

Train generative models with pytorch lightning
MIT License
284 stars 36 forks source link

Distributed training stuck at: Downloading parameters from peer #11

Closed the-beee closed 7 months ago

the-beee commented 1 year ago

I started the training on one machine (works perfectly fine), but when I start the other machine, it gets stuck in downloading parameters. Here's the output I get:

/home/ai/.virtualenvs/DMshareGenesis/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py:151: FutureWarning: The configuration file of the unet has set the default `sample_size` to smaller than 64 which seems highly unlikely. If your checkpoint is a fine-tuned version of any of the following: 
- CompVis/stable-diffusion-v1-4 
- CompVis/stable-diffusion-v1-3 
- CompVis/stable-diffusion-v1-2 
- CompVis/stable-diffusion-v1-1 
- runwayml/stable-diffusion-v1-5 
- runwayml/stable-diffusion-inpainting 
 you should change 'sample_size' to 64 in the configuration file. Please make sure to update the config accordingly as leaving `sample_size=32` in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the `unet/config.json` file
  deprecate("sample_size<64", "1.0.0", deprecation_message, standard_warn=False)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/mmk.tgz" to /tmp/dataset-0

100%|██████████| 52.7M/52.7M [00:10<00:00, 5.39MB/s]
You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Loading captions: 34it [00:00, 11746.82it/s]
Loading resolutions: 0it [00:00, ?it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 34it [00:00, 1118.48it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 5e-06

  | Name         | Type                 | Params
------------------------------------------------------
0 | text_encoder | CLIPTextModel        | 123 M 
1 | vae          | AutoencoderKL        | 83.7 M
2 | unet         | UNet2DConditionModel | 859 M 
------------------------------------------------------
859 M     Trainable params
206 M     Non-trainable params
1.1 B     Total params
4,264.941 Total estimated model params size (MB)
Epoch 0:   0%|          | 0/34 [00:00<?, ?it/s] Found per machine batch size automatically from the batch: 1
Jan 05 07:38:13.862 [INFO] Found no active peers: None
Jan 05 07:38:14.991 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, provide initialize_optimizer=False
Jan 05 07:38:16.819 [INFO] Downloading parameters from peer QmVSfugP26MS1qUBWxXG1RKhZpGkodQuFoM4HFhqTc4mwj

This is potentially a Hivemind issue, but I wanted to check whether anyone has encoutered this issue before.