bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.85k stars 176 forks source link

caching text embeds for validation prompts fails when they do not yet exist on disk #273

Closed janzd closed 10 months ago

janzd commented 10 months ago

I have a question about the dataloader configuration file. What exactly is the "default" parameter and when should I use true and when false? I can't make it run whether I use true or false, though.

When I use "true", I get this error.

  File "/data/src/SimpleTuner/train_sdxl.py", line 1486, in <module>
    main()
  File "/data/src/SimpleTuner/train_sdxl.py", line 458, in main
    ) = prepare_validation_prompt_list(
  File "/data/src/SimpleTuner/helpers/legacy/validation.py", line 73, in prepare_validation_prompt_list
    ) = embed_cache.compute_embeddings_for_sdxl_prompts(
  File "/data/src/SimpleTuner/helpers/caching/sdxl_embeds.py", line 301, in compute_embeddings_for_sdxl_prompts
    prompt_embeds, add_text_embeds = self.load_from_cache(filename)
  File "/data/src/SimpleTuner/helpers/caching/sdxl_embeds.py", line 114, in load_from_cache
    result = self.data_backend.torch_load(filename)
  File "/data/src/SimpleTuner/helpers/data_backend/local.py", line 156, in torch_load
    raise FileNotFoundError(f"{filename} not found.")
FileNotFoundError: ./<PATH>/textembed_cache/5135a082f9c5bd92150b6270b1d326a6-sdxl.pt not found.

I have a validation prompt set in the training script as shown in the example using export VALIDATION_PROMPT="some validation prompt text"

When I use "false", I get this.

  File "/data/src/SimpleTuner/train_sdxl.py", line 1486, in <module>
    main()
  File "/data/src/SimpleTuner/train_sdxl.py", line 458, in main
    ) = prepare_validation_prompt_list(
  File "/data/src/SimpleTuner/helpers/legacy/validation.py", line 32, in prepare_validation_prompt_list
    raise ValueError(
ValueError: Embed cache engine did not contain a model_type. Cannot continue.

I have a directory with training data, both images and their corresponding textfiles with captions. This is my dataloader configuration file.

[
    {
        "id": "<NAME>",
        "type": "local",
        "instance_data_dir": "<PATH>",
        "crop": false,
        "crop_style": "center",
        "crop_aspect": "preserve",
        "resolution": 1.0,
        "resolution_type": "pixel",
        "minimum_image_size": 1.0,
        "prepend_instance_prompt": false,
        "instance_prompt": "",
        "only_instance_prompt": false,
        "caption_strategy": "textfile",
        "cache_dir_vae": "<PATH>/vaecache",
        "vae_cache_clear_each_epoch": false,
        "probability": 1.0,
        "text_embeds": "<TEXT_EMBEDS>"
    },
    {
         "id": "<TEXT_EMBEDS>",
         "dataset_type": "text_embeds",
         "default": false,
         "type": "local",
         "cache_dir": "<PATH>/textembed_cache"
    }
]

I could run training with earlier versions, but I can't successfully run the script after updating to the new version with dataloader configuration file. I've tried both the latest version of the code as of writing this issue and also the most recent release candidate (v0.9.0-rc1). I censored actual paths and names in the error messages and configuration, but I use proper existing paths.

bghira commented 10 months ago

you must always have at least one default text embed backend, it's where things like validation prompt embeds are stored so that we don't have to load the text encoder at that time for a full u-net tuning job. your configuration does look correct, but it seems there is some error when it comes to caching these embeds. i will try and reproduce this issue

janzd commented 10 months ago

Thanks for the answer. I get it now.
It should use the validation prompt and cache the embeddings for it at the beginning of the training, right? But that somehow fails and it can't find the embeddings for the text prompt and thus throws FileNotFoundError. I'll also try to see if I can find what's wrong later today.

bghira commented 10 months ago

if you set SIMPLETUNER_LOG_LEVEL=DEBUG it will print out quite a lot of information

bghira commented 10 months ago

fwiw there's gotta be commits you don't have yet, as L458 for me isn't that method. can you try the master branch?

janzd commented 10 months ago

I already have SIMPLETUNER_LOG_LEVEL=DEBUG set. Okay. I've actually tried both the master branch as of yesterday and the v0.9.0 release candidate, and this was probably the error output from when I tried it with the 0.9-rc version. Here's the traceback produced by the current master branch.

2024-01-18 00:08:10,585 [DEBUG] (VAECache) (Rank: 0) Completed process_buckets, all futures have been returned.              
2024-01-18 00:08:10,586 [DEBUG] (BucketManager) save_cache has config to write: {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'center', 'resolution': 1.0, 'resolution_type': 'pixel'}
2024-01-18 00:08:10,602 [DEBUG] (TextEmbeddingCache) (id=<TEXT_EMBEDS>) Running compute_embeddings_for_sdxl_prompts on 1 prompts..
2024-01-18 00:08:10,604 [DEBUG] (LocalDataBackend) Checking if ./<PATH>/textembed_cache/5135a082f9c5bd92150b6270b1d326a6-sdxl.pt exists = False
Traceback (most recent call last):
  File "/data/src/SimpleTuner/train_sdxl.py", line 1469, in <module>
    main()
  File "/data/src/SimpleTuner/train_sdxl.py", line 446, in main
    ) = prepare_validation_prompt_list(
  File "/data/src/SimpleTuner/helpers/legacy/validation.py", line 73, in prepare_validation_prompt_list
    ) = embed_cache.compute_embeddings_for_sdxl_prompts(
  File "/data/src/SimpleTuner/helpers/caching/sdxl_embeds.py", line 301, in compute_embeddings_for_sdxl_prompts
    prompt_embeds, add_text_embeds = self.load_from_cache(filename)
  File "/data/src/SimpleTuner/helpers/caching/sdxl_embeds.py", line 114, in load_from_cache
    result = self.data_backend.torch_load(filename)
  File "/data/src/SimpleTuner/helpers/data_backend/local.py", line 156, in torch_load
    raise FileNotFoundError(f"{filename} not found.")
FileNotFoundError: ./<PATH>/textembed_cache/5135a082f9c5bd92150b6270b1d326a6-sdxl.pt not found.
janzd commented 10 months ago

Btw this is DEBUG info from the setup.

[2024-01-18 00:01:22,659] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-18 00:01:26,899] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-01-18 00:01:28,230 [INFO] (ArgsParser) Default VAE Cache location: /data/src/SimpleTuner/ckpt/<PROJECT_NAME>/cache_vae
2024-01-18 00:01:28,230 [INFO] (ArgsParser) Text Cache location: cache
2024-01-18 00:01:28,232 [INFO] (__main__) Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

2024-01-18 00:01:28,233 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-01-18 00:01:28,233 [INFO] (__main__) Load tokenizers
2024-01-18 00:01:29,590 [INFO] (__main__) Load text encoder 1..
2024-01-18 00:01:30,857 [INFO] (__main__) Load text encoder 2..
2024-01-18 00:01:35,496 [INFO] (__main__) Load VAE..
2024-01-18 00:01:35,794 [INFO] (__main__) Moving models to GPU. Almost there.
2024-01-18 00:01:36,718 [INFO] (__main__) Creating the U-net..
2024-01-18 00:01:38,011 [INFO] (__main__) Moving the U-net to GPU.
2024-01-18 00:01:44,275 [INFO] (__main__) Enabling xformers memory-efficient attention.
2024-01-18 00:01:44,544 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp16, fp32, default
2024-01-18 00:01:44,544 [DEBUG] (__main__) Initialising VAE with weight dtype torch.bfloat16
2024-01-18 00:01:45,098 [INFO] (__main__) Loaded VAE into VRAM.
2024-01-18 00:01:45,126 [DEBUG] (PromptHandler) Initialising Compel prompt manager with dual text encoders.
2024-01-18 00:01:45,126 [INFO] (DataBackendFactory) Configuring text embed backend: <TEXT_EMBEDS>
2024-01-18 00:01:45,126 [DEBUG] (TextEmbeddingCache) (Rank: 0) Creating cache directory if it doesn't exist.
2024-01-18 00:01:45,126 [DEBUG] (LocalDataBackend) Creating directory: ./<PATH>/textembed_cache
2024-01-18 00:01:45,128 [INFO] (TextEmbeddingCache) (id=<TEXT_EMBEDS>) Listing all text embed cache entries
2024-01-18 00:01:45,135 [DEBUG] (TextEmbeddingCache) (Rank: 0)  -> done listing all text embed cache entries
2024-01-18 00:01:45,135 [INFO] (DataBackendFactory) Pre-computing null embedding for caption dropout
2024-01-18 00:01:45,137 [DEBUG] (TextEmbeddingCache) (id=<TEXT_EMBEDS>) All prompts are cached, ignoring.
2024-01-18 00:01:45,137 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-01-18 00:01:45,138 [INFO] (DataBackendFactory) Loading bucket manager.
2024-01-18 00:01:45,138 [DEBUG] (LocalDataBackend) Checking if <PATH>/aspect_ratio_bucket_indices.json exists = True
2024-01-18 00:01:45,138 [DEBUG] (BucketManager) Pulling cache file from storage.
2024-01-18 00:01:45,159 [DEBUG] (BucketManager) Completed loading cache data.
2024-01-18 00:01:45,159 [DEBUG] (BucketManager) Setting config to {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'center', 'resolution': 1.0, 'resolution_type': 'pixel'}
2024-01-18 00:01:45,159 [DEBUG] (BucketManager) Loaded previous data backend config: {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'center', 'resolution': 1.0, 'resolution_type': 'pixel'}
2024-01-18 00:01:45,160 [INFO] (DataBackendFactory) Refreshing aspect buckets.
2024-01-18 00:01:45,160 [INFO] (BucketManager) Discovering new files...
2024-01-18 00:01:45,161 [DEBUG] (LocalDataBackend) LocalDataBackend.list_files: str_pattern=*.[jJpP][pPnN][gG], instance_data_root=<PATH>
bghira commented 10 months ago

I have located the issue. I missed a spot when updating the cache embed logic this past weekend, so that it wasn't using the correct method for preparing the negative prompt embed. There was a spot that could use some error catching, and now it has it. However, it's possible it could raise further issues when text embeds 'disappear' while training somehow, as they will be computed instead of erroring. I'll open a new issue for that. Please test #275 pull request if possible.