Open playerzer0x opened 1 day ago
your system might be haunted. the respective code makes more sense than the error. let me show:
def batch_write_embeddings(self):
"""Process write requests in batches."""
batch = []
we're initialising batch here to an empty list.
written_elements = 0
while True:
try:
# Block until an item is available or timeout occurs
first_item = self.write_queue.get(timeout=1)
batch = [first_item]
here it is additionally not deleted but is set to a populated list of the first item.
# Try to get more items without blocking
while (
not self.write_queue.empty() and len(batch) < self.write_batch_size
):
this is the line where your error occurs ^
but batch is an empty list or populated with one item.
maybe later in the loop it del the batch:
logger.debug("Retrieving more items from the queue.")
items = self.write_queue.get_nowait()
batch.append(items)
logger.debug(f"Batch now contains {len(batch)} items.")
self.process_write_batch(batch)
self.write_thread_bar.update(len(batch))
logger.debug("Processed batch write.")
written_elements += len(batch)
except queue.Empty:
# Timeout occurred, no items were ready
if not self.process_write_batches:
if len(batch) > 0:
self.process_write_batch(batch)
self.write_thread_bar.update(len(batch))
logger.debug(
f"Exiting batch write thread, no more work to do after writing {written_elements} elements"
)
break
logger.debug(
f"Queue is empty. Retrieving new entries. Should retrieve? {self.process_write_batches}"
)
pass
except Exception:
logger.exception("An error occurred while writing embeddings to disk.")
logger.debug("Exiting background batch write thread.")
nope. no deletions of the batch variable.
it might be a bad RunPod instance?
it might be a bad RunPod instance?
Maybe? I'm sometimes re-using a volume across GPU instances in the same region. Training script eventually moves past these errors and starts training, but the errors come up again every time checkpoints save (takes several minutes). I'll see if I get the same error without using the volume.
Tried running an identical workload on AWS. Start-up gets halted here:
2024-11-13 16:10:54,377 [ERROR] Invalidating cache: error loading all_text_cache_files_text-embed-cache from disk. Expecting value: line 1 column 1 (char 0)
2024-11-13 16:10:54,377 [ERROR] Invalidating cache: error loading all_text_cache_files_text-embed-cache from disk. Expecting value: line 1 column 1 (char 0)
2024-11-13 16:10:54,380 [INFO] Pre-computing null embedding
Write embeds to disk: 0%|
| 0/1 [00:00<?, ?it/s]
Processing prompts: 0%|
Processing prompts: 0%|
| 0/1 [00:17<?, ?it/s]
are you copying folders into the instances from another machine to run without having to do the caching again?
I'm cloning a fresh dataset repo from HuggingFace each time so it should be caching upon startup. Not resuming from an existing checkpoint either.
Same issue on a fresh Runpod instance in a different region/data center, similar workflow (this one I'm resuming a training). It managed to process the first couple datasets, but froze midway through:
2024-11-13 21:10:33,020 [INFO] (id=dct_desert_rally_racing_background-512) Collecting captions.
2024-11-13 21:10:33,069 [INFO] (id=dct_desert_rally_racing_background-512) Initialise text embed pre-computation using the textfile caption strategy. We have 25 captions to process.
2024-11-13 21:10:33,384 [INFO] (id=dct_desert_rally_racing_background-512) Completed processing 25 captions.
2024-11-13 21:10:33,384 [INFO] (id=dct_desert_rally_racing_background-512) Creating VAE latent cache.
Write embeds to disk: 33%|██████████████████████▋ | 1/3 [00:00<00:00, 4.52it/s]
Write embeds to disk: 33%|██████████████████████▋ | 1/3 [00:00<00:00, 3.87it/s]
Processing prompts: 0%| | 0/3 [00:00<?, ?it/s]
Write embeds to disk: 0%| | 0/4 [00:00<?, ?it/s]
Processing prompts: 0%| | 0/3 [00:00<?, ?it/s]
Processing prompts: 0%| | 0/4 [00:00<?, ?it/s]
Processing prompts: 0%| | 0/3 [00:00<?, ?it/s]
Processing prompts: 0%| | 0/3 [00:00<?, ?it/s]
Here's my config.json in case it's helpful:
{
"--ignore_missing_files": "true",
"--vae_cache_ondemand": "true",
"--lycoris_config": "config/lycoris_config.json",
"--resume_from_checkpoint": "latest",
"--data_backend_config": "config/multidatabackend.json",
"--aspect_bucket_rounding": 2,
"--seed": 42,
"--minimum_image_size": 0,
"--disable_benchmark": false,
"--output_dir": "ghxdct_style_focus_20241113_125207",
"--lora_type": "lycoris",
"--max_train_steps": 10000,
"--num_train_epochs": 0,
"--checkpointing_steps": 500,
"--checkpoints_total_limit": 10,
"--tracker_project_name": "ghxdct_style_focus",
"--tracker_run_name": "ghxdct_style_focus_20241113_125207",
"--report_to": "wandb",
"--model_type": "lora",
"--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"--model_family": "flux",
"--train_batch_size": 1,
"--gradient_checkpointing": "true",
"--caption_dropout_probability": 0.05,
"--resolution_type": "pixel_area",
"--resolution": 1024,
"--validation_seed": 69,
"--validation_steps": "500",
"--validation_resolution": "1024x1024",
"--validation_guidance": "3.5",
"--validation_guidance_rescale": "0.0",
"--validation_num_inference_steps": "20",
"--validation_prompt": "a photo of a daisy",
"--mixed_precision": "bf16",
"--optimizer": "optimi-stableadamw",
"--optimizer_config": "weight_decay=1e-3",
"--learning_rate": "5e-06",
"--flux_lora_target": "all+ffs",
"--lr_scheduler": "polynomial",
"--lr_warmup_steps": 100,
"--user_prompt_library": "config/user_prompt_library.json",
"--hub_model_id": "growwithdaisy/ghxdct_style_focus_20241113_125207",
"--push_to_hub": "true",
"--push_checkpoints_to_hub": "true",
"--init_lora": "output/ghxdct_20241106_121319/pytorch_lora_weights.safetensors"
}
And multidatabackend:
[
{
"id": "gh_logo-512",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/gh_logo",
"crop": false,
"crop_style": "random",
"minimum_image_size": 512,
"target_downsample_size": 512,
"resolution": 512,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//gh_logo-vae-512"
},
{
"id": "gh_cans-512",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/gh_cans",
"crop": false,
"crop_style": "random",
"minimum_image_size": 512,
"target_downsample_size": 512,
"resolution": 512,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//gh_cans-vae-512"
},
{
"id": "gh_cans-768",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/gh_cans",
"crop": false,
"crop_style": "random",
"minimum_image_size": 768,
"target_downsample_size": 768,
"resolution": 768,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//gh_cans-vae-768"
},
{
"id": "gh_cans-1024",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/gh_cans",
"crop": false,
"crop_style": "random",
"minimum_image_size": 1024,
"target_downsample_size": 1024,
"resolution": 1024,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//gh_cans-vae-1024"
},
{
"id": "dct_desert_rally_racing_background-512",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/dct_desert_rally_racing_background",
"crop": false,
"crop_style": "random",
"minimum_image_size": 512,
"target_downsample_size": 512,
"resolution": 512,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//dct_desert_rally_racing_background-vae-512"
},
{
"id": "dct_desert_rally_racing_background-768",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/dct_desert_rally_racing_background",
"crop": false,
"crop_style": "random",
"minimum_image_size": 768,
"target_downsample_size": 768,
"resolution": 768,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//dct_desert_rally_racing_background-vae-768"
},
{
"id": "dct_desert_rally_racing_background-1024",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/dct_desert_rally_racing_background",
"crop": false,
"crop_style": "random",
"minimum_image_size": 1024,
"target_downsample_size": 1024,
"resolution": 1024,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//dct_desert_rally_racing_background-vae-1024"
},
{
"id": "anytylrjy_woman-512",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/anytylrjy_woman",
"crop": false,
"crop_style": "random",
"minimum_image_size": 512,
"target_downsample_size": 512,
"resolution": 512,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//anytylrjy_woman-vae-512"
},
{
"id": "anytylrjy_woman-768",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/anytylrjy_woman",
"crop": false,
"crop_style": "random",
"minimum_image_size": 768,
"target_downsample_size": 768,
"resolution": 768,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//anytylrjy_woman-vae-768"
},
{
"id": "anytylrjy_woman-1024",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/anytylrjy_woman",
"crop": false,
"crop_style": "random",
"minimum_image_size": 1024,
"target_downsample_size": 1024,
"resolution": 1024,
"resolution_type": "pixel_area",
"repeats": 0,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//anytylrjy_woman-vae-1024"
},
{
"id": "mrtnprr_style-512",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/mrtnprr_style",
"crop": false,
"crop_style": "random",
"minimum_image_size": 512,
"target_downsample_size": 512,
"resolution": 512,
"resolution_type": "pixel_area",
"repeats": 1,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//mrtnprr_style-vae-512"
},
{
"id": "mrtnprr_style-768",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/mrtnprr_style",
"crop": false,
"crop_style": "random",
"minimum_image_size": 768,
"target_downsample_size": 768,
"resolution": 768,
"resolution_type": "pixel_area",
"repeats": 1,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//mrtnprr_style-vae-768"
},
{
"id": "mrtnprr_style-1024",
"type": "local",
"instance_data_dir": "datasets/ghxdct_style_focus/mrtnprr_style",
"crop": false,
"crop_style": "random",
"minimum_image_size": 1024,
"target_downsample_size": 1024,
"resolution": 1024,
"resolution_type": "pixel_area",
"repeats": 1,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//mrtnprr_style-vae-1024"
},
{
"id": "text-embed-cache",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache//text",
"disabled": false,
"write_batch_size": 1
}
]
Datasets are very small. <=25 images each, all PNG.
Switched to main
(was formerly on release
), and now training successfully proceeds. Could be good to update Flux quick start instructions if main
is the correct branch to use.
Training above failed after 15 steps:
Epoch 1/113, Steps: 0%| | 15/10000 [00:52<10:02:57, 3.62s/it, lr=7.5e-7, mean_cfg=1, step_loss=0.521]2024-11-13 21:30:05,136 [ERROR] Failed to load corrupt torch file '/workspace/SimpleTuner/cache/text/10f34c853ae35b374b489315e55183a4-flux.pt': PytorchStreamReader failed reading zip archive: failed finding central directory
2024-11-13 21:30:05,138 [ERROR] Failed retrieving prompt from cache:
-> prompt: dct desert rally racing background, driving a Ducati bike, riding through water
-> filename: /workspace/SimpleTuner/cache/text/10f34c853ae35b374b489315e55183a4-flux.pt
-> error: PytorchStreamReader failed reading zip archive: failed finding central directory
-> id: text-embed-cache, data_backend id: text-embed-cache
Cache retrieval for text embed file failed. Ensure your dataloader config value for skip_file_discovery does not contain 'text', and that preserve_data_backend_cache is disabled or unset.
Traceback (most recent call last):
File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 1110, in compute_embeddings_for_flux_prompts
_flux_embed = self.load_from_cache(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 277, in load_from_cache
result = self.data_backend.torch_load(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/helpers/data_backend/local.py", line 207, in torch_load
raise e
File "/workspace/SimpleTuner/helpers/data_backend/local.py", line 202, in torch_load
loaded_tensor = torch.load(stored_tensor, map_location="cpu")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/.venv/lib/python3.11/site-packages/torch/serialization.py", line 1072, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/.venv/lib/python3.11/site-packages/torch/serialization.py", line 480, in __init__
super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/SimpleTuner/train.py", line 49, in <module>
trainer.train()
File "/workspace/SimpleTuner/helpers/training/trainer.py", line 2136, in train
batch = iterator_fn(step, *iterator_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/helpers/data_backend/factory.py", line 1377, in random_dataloader_iterator
return next(chosen_iter)
^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/helpers/data_backend/factory.py", line 977, in <lambda>
collate_fn=lambda examples: collate_fn(examples),
^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/helpers/training/collate.py", line 534, in collate_fn
compute_prompt_embeddings(captions, text_embed_cache)
File "/workspace/SimpleTuner/helpers/training/collate.py", line 290, in compute_prompt_embeddings
embeddings = list(
^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/SimpleTuner/helpers/training/collate.py", line 241, in compute_single_embedding
text_embed_cache.compute_embeddings_for_flux_prompts(prompts=[caption])
File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 1134, in compute_embeddings_for_flux_prompts
raise Exception(
Exception: Cache retrieval for text embed file failed. Ensure your dataloader config value for skip_file_discovery does not contain 'text', and that preserve_data_backend_cache is disabled or unset.
``
I'm training Flux on multi-gpu Runpod instances and receiving this UnboundLocalError during training start-up, and upon checkpoints saves. Training continues thankfully, but sharing in case it needs to be fixed: