Failed to train because of error: LoRA Trainer crashes when trying to save save state to a directory in the F:

peteer01 commented 3 months ago

I am able to successfully run LoRA Easy Training Scripts without issue. When loading the same Toml file and only changing the "Save State" and "Save Last State" options to on and "Epochs","1" and trying to save to the same F:\LoRA settings as the Toml and LoRA safetensors files, I get the following error at the completion of the first Epoch:

RuntimeError: Parent directory F: does not exist.
steps:   0%|▏                                                     | 31/9300 [14:32<72:25:56, 28.13s/it, avr_loss=0.091]
Failed to train because of error:
Command '['C:\\Users\\petee\\LoRA_Easy_Training_Scripts\\sd_scripts\\venv\\Scripts\\python.exe', 'sd_scripts\\sdxl_train_network.py', '--config_file=runtime_store\\config.toml', '--dataset_config=runtime_store\\dataset.toml']' returned non-zero exit status 1.

It appears that this error occurs either because the directory is in the root of the F: or the app doesn't like the F: for saving save states. Workaround is to save to a folder in the C:. (Currently saving to a subfolder inside the Easy_LoRA_Training_Scripts folder)

Having the training data in a folder in the root of the F: does not create an issue, nor does saving the LoRA epochs, it's only when trying to save the save state in the F: that LoRA Trainer crashes.

I hope that is helpful. Let me know what additional information might be helpful.

Jelosus2 commented 3 months ago

Can you post the full error?

peteer01 commented 3 months ago

RuntimeError: Parent directory F: does not exist.
steps: 0%|▏ | 31/9300 [14:32<72:25:56, 28.13s/it, avr_loss=0.091]
Failed to train because of error:
Command '['C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\Scripts\python.exe', 'sd_scripts\sdxl_train_network.py', '--config_file=runtime_store\config.toml', '--dataset_config=runtime_store\dataset.toml']' returned non-zero exit status 1.

saving checkpoint: F:/LoRA settings\epoch-000001.safetensors

saving state at epoch 1
Traceback (most recent call last):
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\sdxl_train_network.py", line 189, in
trainer.train(args)
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\train_network.py", line 883, in train
train_util.save_and_remove_state_on_epoch_end(args, accelerator, epoch + 1)
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\library\train_util.py", line 4322, in save_and_remove_state_on_epoch_end
accelerator.save_state(state_dir)
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\lib\site-packages\accelerate\accelerator.py", line 2795, in save_state
save_location = save_accelerator_state(
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\lib\site-packages\accelerate\checkpointing.py", line 76, in save_accelerator_state
save(state, output_model_file)
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\lib\site-packages\accelerate\utils\other.py", line 127, in save
torch.save(obj, f)
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\lib\site-packages\torch\serialization.py", line 628, in save
with _open_zipfile_writer(f) as opened_zipfile:
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\lib\site-packages\torch\serialization.py", line 502, in _open_zipfile_writer
return container(name_or_buffer)
File "C:\Users\petee\LoRA_Easy_Training_Scripts\sd_scripts\venv\lib\site-packages\torch\serialization.py", line 473, in init
super().init(torch._C.PyTorchFileWriter(self.name))

Jelosus2 commented 3 months ago

I think I know what the issue, to confirm, could you save the epochs and training config in a subdirectory and see if it errors out? Like in F:/LoRA settings/subdirectory. If the issue still persist try to remove the spaces from the folder name

peteer01 commented 3 months ago

Making a subdirectory and saving there prevented the error from occurring.

steps:  50%|██████████████████████████████                              | 5/10 [03:20<03:20, 40.03s/it, avr_loss=0.134]
saving checkpoint: F:/LoRA settings/subfolder\epoch-000001.safetensors

saving state at epoch 1

epoch 2/2

Jelosus2 commented 3 months ago

Nice, I guess the issue can be closed now?

peteer01 commented 3 months ago

Is there an easy fix that can be made to prevent this error? Not sure how difficult it would be to fix. If using a subfolder is necessary to avoid that issue, it might be good to add that to the documentation until it's fixed.

derrian-distro commented 3 months ago

I can assume that this will never be fixed, tbh. The only way to get it fixed is to open up an issue with kohya's sd-scripts

derrian-distro / LoRA_Easy_Training_Scripts

Failed to train because of error: LoRA Trainer crashes when trying to save save state to a directory in the F: #199