ShivamShrirao / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
https://huggingface.co/docs/diffusers
Apache License 2.0
1.89k stars 505 forks source link

I deleted the checkpoints and the new training stopped working. #170

Open Maximus-show opened 1 year ago

Maximus-show commented 1 year ago

Describe the bug

Hello. Tell me why the new training does not work if the checkpoints of the previous training are deleted?

Reproduction

The following values were not passed to accelerate launch and had defaults used instead: --num_cpu_threads_per_process was set to 8 to improve out-of-box performance To avoid this warning pass in values for each of the problematic parameters or run accelerate config. [!] Not using xformers memory efficient attention. Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 2118.76it/s]Traceback (most recent call last): File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 90, in load_state_dict return torch.load(checkpoint_file, map_location="cpu") File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/serialization.py", line 777, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/serialization.py", line 282, in init super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 94, in load_state_dict if f.read().startswith("version"): File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/maximus_show/github/diffusers/examples/dreambooth/train_dreambooth.py", line 822, in main(args) File "/home/maximus_show/github/diffusers/examples/dreambooth/train_dreambooth.py", line 448, in main pipeline = StableDiffusionPipeline.from_pretrained( File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 659, in from_pretrained loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs) File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 470, in from_pretrained state_dict = load_state_dict(model_file) File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 106, in load_state_dict raise OSError( OSError: Unable to load weights from pytorch checkpoint file for '/home/maximus_show/.cache/huggingface/diffusers/models--runwayml--stable-diffusion-v1-5/snapshots/ded79e214aa69e42c24d3f5ac14b76d568679cc2/unet/diffusion_pytorch_model.bin' at '/home/maximus_show/.cache/huggingface/diffusers/models--runwayml--stable-diffusion-v1-5/snapshots/ded79e214aa69e42c24d3f5ac14b76d568679cc2/unet/diffusion_pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. Traceback (most recent call last): File "/home/maximus_show/anaconda3/envs/diffusers/bin/accelerate", line 8, in sys.exit(main()) File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/maximus_show/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--pretrained_vae_name_or_path=stabilityai/sd-vae-ft-mse', '--output_dir=../../../models/alvan_shivam', '--revision=fp16', '--with_prior_preservation', '--prior_loss_weight=1.0', '--seed=3434554', '--resolution=512', '--train_batch_size=1', '--train_text_encoder', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=1e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=50', '--sample_batch_size=4', '--max_train_steps=800', '--save_interval=400', '--save_sample_prompt=maximus', '--concepts_list=concepts_list.json']' returned non-zero exit status 1.

Logs

No response

System Info

ubuntu 20.04 nvidia 3060