Hello. Tell me why the new training does not work if the checkpoints of the previous training are deleted?
Reproduction
The following values were not passed to accelerate launch and had defaults used instead:
--num_cpu_threads_per_process was set to 8 to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[!] Not using xformers memory efficient attention.
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 2118.76it/s]Traceback (most recent call last):
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 90, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/serialization.py", line 777, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/serialization.py", line 282, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 94, in load_state_dict
if f.read().startswith("version"):
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/maximus_show/github/diffusers/examples/dreambooth/train_dreambooth.py", line 822, in
main(args)
File "/home/maximus_show/github/diffusers/examples/dreambooth/train_dreambooth.py", line 448, in main
pipeline = StableDiffusionPipeline.from_pretrained(
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 659, in from_pretrained
loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 470, in from_pretrained
state_dict = load_state_dict(model_file)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 106, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '/home/maximus_show/.cache/huggingface/diffusers/models--runwayml--stable-diffusion-v1-5/snapshots/ded79e214aa69e42c24d3f5ac14b76d568679cc2/unet/diffusion_pytorch_model.bin' at '/home/maximus_show/.cache/huggingface/diffusers/models--runwayml--stable-diffusion-v1-5/snapshots/ded79e214aa69e42c24d3f5ac14b76d568679cc2/unet/diffusion_pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Traceback (most recent call last):
File "/home/maximus_show/anaconda3/envs/diffusers/bin/accelerate", line 8, in
sys.exit(main())
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/maximus_show/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--pretrained_vae_name_or_path=stabilityai/sd-vae-ft-mse', '--output_dir=../../../models/alvan_shivam', '--revision=fp16', '--with_prior_preservation', '--prior_loss_weight=1.0', '--seed=3434554', '--resolution=512', '--train_batch_size=1', '--train_text_encoder', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=1e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=50', '--sample_batch_size=4', '--max_train_steps=800', '--save_interval=400', '--save_sample_prompt=maximus', '--concepts_list=concepts_list.json']' returned non-zero exit status 1.
Describe the bug
Hello. Tell me why the new training does not work if the checkpoints of the previous training are deleted?
Reproduction
The following values were not passed to
accelerate launch
and had defaults used instead:--num_cpu_threads_per_process
was set to8
to improve out-of-box performance To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. [!] Not using xformers memory efficient attention. Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 2118.76it/s]Traceback (most recent call last): File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 90, in load_state_dict return torch.load(checkpoint_file, map_location="cpu") File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/serialization.py", line 777, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/serialization.py", line 282, in init super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directoryDuring handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 94, in load_state_dict if f.read().startswith("version"): File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/maximus_show/github/diffusers/examples/dreambooth/train_dreambooth.py", line 822, in
main(args)
File "/home/maximus_show/github/diffusers/examples/dreambooth/train_dreambooth.py", line 448, in main
pipeline = StableDiffusionPipeline.from_pretrained(
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/pipeline_utils.py", line 659, in from_pretrained
loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 470, in from_pretrained
state_dict = load_state_dict(model_file)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 106, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '/home/maximus_show/.cache/huggingface/diffusers/models--runwayml--stable-diffusion-v1-5/snapshots/ded79e214aa69e42c24d3f5ac14b76d568679cc2/unet/diffusion_pytorch_model.bin' at '/home/maximus_show/.cache/huggingface/diffusers/models--runwayml--stable-diffusion-v1-5/snapshots/ded79e214aa69e42c24d3f5ac14b76d568679cc2/unet/diffusion_pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Traceback (most recent call last):
File "/home/maximus_show/anaconda3/envs/diffusers/bin/accelerate", line 8, in
sys.exit(main())
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/home/maximus_show/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/maximus_show/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--pretrained_vae_name_or_path=stabilityai/sd-vae-ft-mse', '--output_dir=../../../models/alvan_shivam', '--revision=fp16', '--with_prior_preservation', '--prior_loss_weight=1.0', '--seed=3434554', '--resolution=512', '--train_batch_size=1', '--train_text_encoder', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=1e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=50', '--sample_batch_size=4', '--max_train_steps=800', '--save_interval=400', '--save_sample_prompt=maximus', '--concepts_list=concepts_list.json']' returned non-zero exit status 1.
Logs
No response
System Info
ubuntu 20.04 nvidia 3060