[Bug]: Repeated error during training after 1 epoch and 10% of the next.

velourlawsuits commented 3 weeks ago

What happened?

I've been experiencing the same error for the last few months every time I try to train a model (SDXL finetune) I get an error that aborts training after 1 epoch at 10% completion of the next epoch. I've tried doing a reinstallation from scratch, updating, experimenting using different parameters, etc. Nothing seems to work. This was not previously a problem and I have no idea what caused it to suddenly appear but I can't train models past 1 epoch anymore which is very frustrating. Any help would be very appreciated.

EDIT: It appears that the issue is specific to saving the model output in the diffusers format. I just ran a .safetensors output that's now on epoch 4 and counting. I have had false positives a few other times so I'll update this again if the issue shows up again in the next finetune I run, which should be within the week.

What did you expect would happen?

The model should have trained to 10 epochs as I stated in the parameters.

Relevant log output

activating venv D:\Stable Diffusion\OneTrainer\venv
Using Python "D:\Stable Diffusion\OneTrainer\venv\Scripts\python.exe"
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
TensorFlow installation not found - running with reduced feature set.
model.safetensors:   1%|▍                                                         | 21.0M/2.78G [00:01<02:55, 15.7MB/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.17.0 at http://localhost:6006/ (Press CTRL+C to quit)
model.safetensors: 100%|██████████████████████████████████████████████████████████| 2.78G/2.78G [02:41<00:00, 17.2MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 607/607 [00:00<?, ?B/s]
diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████| 335M/335M [00:19<00:00, 17.4MB/s]
diffusion_pytorch_model.safetensors: 100%|████████████████████████████████████████| 10.3G/10.3G [10:18<00:00, 16.6MB/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s]D:\Stable Diffusion\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1476: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 5707/5707 [14:27<00:00,  6.58it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 5707/5707 [01:43<00:00, 55.37it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:33<00:00,  1.50it/s]
step: 100%|█████████████████████████████████████| 5707/5707 [4:46:15<00:00,  3.01s/it, loss=0.00497, smooth loss=0.129]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:32<00:00,  1.53it/s]
Saving workspace/run\save\instagirl2024-08-16_18-43-53-save-5707-1-0                          | 0/5707 [00:00<?, ?it/s]
step:   0%|                                                                                   | 0/5707 [00:43<?, ?it/s]
epoch:  10%|██████▊                                                             | 1/10 [5:03:12<45:28:55, 18192.87s/it]
Traceback (most recent call last):
  File "D:\Stable Diffusion\OneTrainer\modules\ui\TrainUI.py", line 543, in __training_thread_function
    trainer.train()
  File "D:\Stable Diffusion\OneTrainer\modules\trainer\GenericTrainer.py", line 575, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 280, in predict
    text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 246, in __encode_text
    text_encoder_1_output = model.text_encoder_1(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 806, in forward
    return self.text_model(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 698, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 218, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 42, in forward
    return F.embedding(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/instagirl

Output of `pip freeze`

No response

mx commented 3 weeks ago

You need to upload your configuration. I suspect this is an issue with that. Please go to the discord help channel, read the pin for how to export your config, and follow those instructions. Closing this for now since it's likely a config issue. If it turns out to be an actual bug, will reopen.

mx commented 3 weeks ago

Confirmed to be an actual issue; user will upload their configs here.

velourlawsuits commented 3 weeks ago

instagirl_config.json

Nerogar / OneTrainer