Closed janzd closed 10 months ago
you must always have at least one default text embed backend, it's where things like validation prompt embeds are stored so that we don't have to load the text encoder at that time for a full u-net tuning job. your configuration does look correct, but it seems there is some error when it comes to caching these embeds. i will try and reproduce this issue
Thanks for the answer. I get it now.
It should use the validation prompt and cache the embeddings for it at the beginning of the training, right? But that somehow fails and it can't find the embeddings for the text prompt and thus throws FileNotFoundError.
I'll also try to see if I can find what's wrong later today.
if you set SIMPLETUNER_LOG_LEVEL=DEBUG
it will print out quite a lot of information
fwiw there's gotta be commits you don't have yet, as L458 for me isn't that method. can you try the master branch?
I already have SIMPLETUNER_LOG_LEVEL=DEBUG
set.
Okay. I've actually tried both the master branch as of yesterday and the v0.9.0 release candidate, and this was probably the error output from when I tried it with the 0.9-rc version.
Here's the traceback produced by the current master branch.
2024-01-18 00:08:10,585 [DEBUG] (VAECache) (Rank: 0) Completed process_buckets, all futures have been returned.
2024-01-18 00:08:10,586 [DEBUG] (BucketManager) save_cache has config to write: {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'center', 'resolution': 1.0, 'resolution_type': 'pixel'}
2024-01-18 00:08:10,602 [DEBUG] (TextEmbeddingCache) (id=<TEXT_EMBEDS>) Running compute_embeddings_for_sdxl_prompts on 1 prompts..
2024-01-18 00:08:10,604 [DEBUG] (LocalDataBackend) Checking if ./<PATH>/textembed_cache/5135a082f9c5bd92150b6270b1d326a6-sdxl.pt exists = False
Traceback (most recent call last):
File "/data/src/SimpleTuner/train_sdxl.py", line 1469, in <module>
main()
File "/data/src/SimpleTuner/train_sdxl.py", line 446, in main
) = prepare_validation_prompt_list(
File "/data/src/SimpleTuner/helpers/legacy/validation.py", line 73, in prepare_validation_prompt_list
) = embed_cache.compute_embeddings_for_sdxl_prompts(
File "/data/src/SimpleTuner/helpers/caching/sdxl_embeds.py", line 301, in compute_embeddings_for_sdxl_prompts
prompt_embeds, add_text_embeds = self.load_from_cache(filename)
File "/data/src/SimpleTuner/helpers/caching/sdxl_embeds.py", line 114, in load_from_cache
result = self.data_backend.torch_load(filename)
File "/data/src/SimpleTuner/helpers/data_backend/local.py", line 156, in torch_load
raise FileNotFoundError(f"{filename} not found.")
FileNotFoundError: ./<PATH>/textembed_cache/5135a082f9c5bd92150b6270b1d326a6-sdxl.pt not found.
Btw this is DEBUG info from the setup.
[2024-01-18 00:01:22,659] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-18 00:01:26,899] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-01-18 00:01:28,230 [INFO] (ArgsParser) Default VAE Cache location: /data/src/SimpleTuner/ckpt/<PROJECT_NAME>/cache_vae
2024-01-18 00:01:28,230 [INFO] (ArgsParser) Text Cache location: cache
2024-01-18 00:01:28,232 [INFO] (__main__) Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: bf16
2024-01-18 00:01:28,233 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-01-18 00:01:28,233 [INFO] (__main__) Load tokenizers
2024-01-18 00:01:29,590 [INFO] (__main__) Load text encoder 1..
2024-01-18 00:01:30,857 [INFO] (__main__) Load text encoder 2..
2024-01-18 00:01:35,496 [INFO] (__main__) Load VAE..
2024-01-18 00:01:35,794 [INFO] (__main__) Moving models to GPU. Almost there.
2024-01-18 00:01:36,718 [INFO] (__main__) Creating the U-net..
2024-01-18 00:01:38,011 [INFO] (__main__) Moving the U-net to GPU.
2024-01-18 00:01:44,275 [INFO] (__main__) Enabling xformers memory-efficient attention.
2024-01-18 00:01:44,544 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp16, fp32, default
2024-01-18 00:01:44,544 [DEBUG] (__main__) Initialising VAE with weight dtype torch.bfloat16
2024-01-18 00:01:45,098 [INFO] (__main__) Loaded VAE into VRAM.
2024-01-18 00:01:45,126 [DEBUG] (PromptHandler) Initialising Compel prompt manager with dual text encoders.
2024-01-18 00:01:45,126 [INFO] (DataBackendFactory) Configuring text embed backend: <TEXT_EMBEDS>
2024-01-18 00:01:45,126 [DEBUG] (TextEmbeddingCache) (Rank: 0) Creating cache directory if it doesn't exist.
2024-01-18 00:01:45,126 [DEBUG] (LocalDataBackend) Creating directory: ./<PATH>/textembed_cache
2024-01-18 00:01:45,128 [INFO] (TextEmbeddingCache) (id=<TEXT_EMBEDS>) Listing all text embed cache entries
2024-01-18 00:01:45,135 [DEBUG] (TextEmbeddingCache) (Rank: 0) -> done listing all text embed cache entries
2024-01-18 00:01:45,135 [INFO] (DataBackendFactory) Pre-computing null embedding for caption dropout
2024-01-18 00:01:45,137 [DEBUG] (TextEmbeddingCache) (id=<TEXT_EMBEDS>) All prompts are cached, ignoring.
2024-01-18 00:01:45,137 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-01-18 00:01:45,138 [INFO] (DataBackendFactory) Loading bucket manager.
2024-01-18 00:01:45,138 [DEBUG] (LocalDataBackend) Checking if <PATH>/aspect_ratio_bucket_indices.json exists = True
2024-01-18 00:01:45,138 [DEBUG] (BucketManager) Pulling cache file from storage.
2024-01-18 00:01:45,159 [DEBUG] (BucketManager) Completed loading cache data.
2024-01-18 00:01:45,159 [DEBUG] (BucketManager) Setting config to {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'center', 'resolution': 1.0, 'resolution_type': 'pixel'}
2024-01-18 00:01:45,159 [DEBUG] (BucketManager) Loaded previous data backend config: {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'center', 'resolution': 1.0, 'resolution_type': 'pixel'}
2024-01-18 00:01:45,160 [INFO] (DataBackendFactory) Refreshing aspect buckets.
2024-01-18 00:01:45,160 [INFO] (BucketManager) Discovering new files...
2024-01-18 00:01:45,161 [DEBUG] (LocalDataBackend) LocalDataBackend.list_files: str_pattern=*.[jJpP][pPnN][gG], instance_data_root=<PATH>
I have located the issue. I missed a spot when updating the cache embed logic this past weekend, so that it wasn't using the correct method for preparing the negative prompt embed. There was a spot that could use some error catching, and now it has it. However, it's possible it could raise further issues when text embeds 'disappear' while training somehow, as they will be computed instead of erroring. I'll open a new issue for that. Please test #275 pull request if possible.
I have a question about the dataloader configuration file. What exactly is the "default" parameter and when should I use true and when false? I can't make it run whether I use true or false, though.
When I use "true", I get this error.
I have a validation prompt set in the training script as shown in the example using
export VALIDATION_PROMPT="some validation prompt text"
When I use "false", I get this.
I have a directory with training data, both images and their corresponding textfiles with captions. This is my dataloader configuration file.
I could run training with earlier versions, but I can't successfully run the script after updating to the new version with dataloader configuration file. I've tried both the latest version of the code as of writing this issue and also the most recent release candidate (v0.9.0-rc1). I censored actual paths and names in the error messages and configuration, but I use proper existing paths.