ShivamShrirao / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
https://huggingface.co/docs/diffusers
Apache License 2.0
1.89k stars 505 forks source link

No such file or directory-pretrained_model_name_or_path=CompVis/stable-diffusion-v1 #90

Open Dreamweaveress opened 1 year ago

Dreamweaveress commented 1 year ago

Hello there,

I followed the guide from - https://www.youtube.com/watch?v=w6PTviOCYQY - step by step. At the end I have this error

I have the hugging face token and ran almost all commands again, but I dont know where the error is. Here the whole output after starting the training.

(base) master@DESKTOP-MB8HN81:~/github/diffusers/examples/dreambooth$ ./my_training.sh The following values were not passed toaccelerate launchand had defaults used instead: --num_processeswas set to0 --num_cpu_threads_per_processwas set to12to improve out-of-box performance To avoid this warning pass in values for each of the problematic parameters or runaccelerate config. Traceback (most recent call last): File "/home/master/anaconda3/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 215, in launch_agent spec = WorkerSpec( File "<string>", line 15, in __init__ File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 87, in __post_init__ assert self.local_world_size > 0 AssertionError Traceback (most recent call last): File "/home/master/anaconda3/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/master/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/home/master/anaconda3/lib/python3.9/site-packages/accelerate/commands/launch.py", line 831, in launch_command multi_gpu_launcher(args) File "/home/master/anaconda3/lib/python3.9/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '0', 'train_dreambooth.py', '\r']' returned non-zero exit status 1. : No such file or directory-pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 ./my_training.sh: line 7: $'--instance_data_dir=training\r': command not found ./my_training.sh: line 8: $'--class_data_dir=classes\r': command not found ./my_training.sh: line 9: $'--output_dir=output\r': command not found ./my_training.sh: line 10: --with_prior_preservation: command not found ./my_training.sh: line 11: --instance_prompt=a photo of pat22: command not found ./my_training.sh: line 12: --class_prompt=a photo of pat22: command not found ./my_training.sh: line 13: --resolution=512: command not found ./my_training.sh: line 14: --train_batch_size=1: command not found ./my_training.sh: line 15: --gradient_accumulation_steps=2: command not found ./my_training.sh: line 16: --use_8bit_adam: command not found ./my_training.sh: line 17: --learning_rate=5e-6: command not found ./my_training.sh: line 18: --lr_scheduler=constant: command not found ./my_training.sh: line 19: --lr_warmup_steps=0: command not found ./my_training.sh: line 20: --num_class_images=200: command not found ./my_training.sh: line 21: --max_train_steps=800: command not found

subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '0', 'train_dreambooth.py', '\r']' returned non-zero exit status 1. : No such file or directory-pretrained_model_name_or_path=CompVis/stable-diffusion-v1

InB4DevOps commented 1 year ago

Please post your my_training.sh file

InB4DevOps commented 1 year ago

Did you save the file on a windows PC? make sure the line endings are LF only, not CRLF (notepad++ has a function for that, google it)

Dreamweaveress commented 1 year ago

Hey there, thank you for helping me.

Changed the CLRF to LF with Notepad++ saved the file, pushed it to Ubuntu and made it executeable again.

Here the my_training.sh

export MODEL_NAME="CompVis/stable-diffusion-v1-4" export INSTANCE_DIR="training" export CLASS_DIR="classes" export OUTPUT_DIR="output" accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --output_dir=$OUTPUT_DIR \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="a photo of pat22" \ --class_prompt="a photo of pat22" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=2 --gradient_checkpointing \ --use_8bit_adam \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --max_train_steps=800

I ran the my_trainings.sh again but got almost the same error

The following values were not passed toaccelerate launchand had defaults used instead: --num_processeswas set to0 --num_cpu_threads_per_processwas set to12to improve out-of-box performance To avoid this warning pass in values for each of the problematic parameters or runaccelerate config. Traceback (most recent call last): File "/home/master/anaconda3/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 215, in launch_agent spec = WorkerSpec( File "<string>", line 15, in __init__ File "/home/master/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 87, in __post_init__ assert self.local_world_size > 0 AssertionError Traceback (most recent call last): File "/home/master/anaconda3/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/master/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/home/master/anaconda3/lib/python3.9/site-packages/accelerate/commands/launch.py", line 831, in launch_command multi_gpu_launcher(args) File "/home/master/anaconda3/lib/python3.9/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '0', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=training', '--class_data_dir=classes', '--output_dir=output', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=a photo of pat22', '--class_prompt=a photo of pat22', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=800']' returned non-zero exit status 1.

petekay commented 1 year ago

I get the same error:

Entry Not Found for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/config.json.

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="blub/training"
export CLASS_DIR="blub/classes"
export OUTPUT_DIR="blub/model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="photo of smnb person" \
  --class_prompt="photo of person" \
  --seed=1337 \
  --resolution=512 \
  --train_batch_size=1 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --learning_rate=2e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800

End-of-Line is always LF, checked it with notepad++, also the final line is LF.

it worked yesterday, i just pulled the newest code, but i had to remove use_auth_token, this worked yesterday (pre-pull):

--pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \

if i remove the use_auth_token, i get the above error with entry not found. i am also logged in to huggingface cli with my token, like always.

InB4DevOps commented 1 year ago

Try to delete the folders in /home/username/.cache/huggingface/diffusers/ This will redownload the selected model. Then try to start another training. For this best set max training steps to 10 or so, so you don't need to wait thousands of steps to see it failing after that. Sometimes it downloads files even after a training hast failed and then the next training works. I also had this once and this is what helped.

BTW @Dreamweaveress Better change this '--instance_prompt=a photo of pat22', '--class_prompt=a photo of pat22',

This is not the cause for your problems but I'd change this to eg --instance_prompt="a photo of a pat22 thing" --class_prompt="a photo of a thing"

no idea what pat22 is, but assuming it's a man change thing to man Then when generating an image use "a photo of a pat22 man [...]"

Dreamweaveress commented 1 year ago

Hey there, found out that the folder is empty

image

Can it be that the folders are missing the files that, they need? Followed everystep from this pastebin of the youtube video I posted before - https://pastebin.com/uE1WcSxD

When I run them again, I get the answer that the files are already downloaded.

@petekay - thx, tried with authcode and with but it is the same.

InB4DevOps commented 1 year ago

That's the wrong folder. Go to home/master/.cache/huggingface and delete the diffusers folder

InB4DevOps commented 1 year ago

This might also help as there seems to be a problem with this fork downloading model files

https://github.com/ShivamShrirao/diffusers/issues/50#issuecomment-1294854643

petekay commented 1 year ago

thank you, i deleted the cache but i get a slightly different error know:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py", line 213, in hf_raise_for_status
    response.raise_for_status()
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/diffusion_pytorch_model.bin

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 327, in from_pretrained
    model_file = hf_hub_download(
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1053, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1359, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py", line 231, in hf_raise_for_status
    raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Tx....)

Entry Not Found for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/diffusion_pytorch_model.bin.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/github/diffusers/examples/dreambooth/train_dreambooth.py", line 765, in <module>
    main()
  File "/home/username/github/diffusers/examples/dreambooth/train_dreambooth.py", line 431, in main
    vae=AutoencoderKL.from_pretrained(args.pretrained_vae_name_or_path or args.pretrained_model_name_or_path),
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/modeling_utils.py", line 355, in from_pretrained
    raise EnvironmentError(
OSError: CompVis/stable-diffusion-v1-4 does not appear to have a file named diffusion_pytorch_model.bin.
Traceback (most recent call last):
  File "/home/username/anaconda3/envs/diffusers/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/username/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=myoutput/training', '--class_data_dir=myoutput/classes', '--output_dir=myoutput/model', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=photo of smnb person', '--class_prompt=photo of person', '--seed=1337', '--resolution=512', '--train_batch_size=1', '--sample_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--use_8bit_adam', '--num_class_images=50', '--max_train_steps=800', '--mixed_precision=fp16']' returned non-zero exit status 1.

you can check the link yourself: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/diffusion_pytorch_model.bin --> Entry not found

the correct link should be this one: https://huggingface.co/CompVis/stable-diffusion-v1-4/blob/main/vae/diffusion_pytorch_model.bin

or more specific this one: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/vae/diffusion_pytorch_model.bin (download will start if you click this)

as you see: /vae/ is a sub-folder, which is missing above.

InB4DevOps commented 1 year ago

Please try this #50