[Bug]: Multiple instances of crash during SDXL (Finetune) training

What happened?

I've been experiencing a periodic crash during training SDXL finetune. My training settings have been identical for the past 2 months and this bug is recent as of a week ago. I have tried updating, redownloading the requirements, and a fresh install in a separate directory. Each 'fix' has worked for 1 training session but whenever I start a new project it seems to run into the same error. I had previously opened a bug report for this but closed it because I thought the update had fixed it. I will reopen that as well. Again this is a new error and the only things I ever changed were the dataset for the new training, and tweaks to filenames. Furthermore the dataset I'm using has been adjusted 3 times since the first error, and if it was the problem, why would the training work on a fresh install/update?

What did you expect would happen?

The training should have completed without error.

Relevant log output

activating venv D:\one_ai\OneTrainer\venv
Using Python "D:\one_ai\OneTrainer\venv\Scripts\python.exe"
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
D:\one_ai\OneTrainer\venv\src\diffusers\src\diffusers\loaders\single_file.py:340: FutureWarning: `original_config_file` is deprecated and will be removed in version 1.0.0. `original_config_file` argument is deprecated and will be removed in future versions.please use the `original_config` argument instead.
  deprecate("original_config_file", "1.0.0", deprecation_message)
TensorFlow installation not found - running with reduced feature set.
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████| 17/17 [00:00<?, ?it/s]
Loading pipeline components...:  14%|███████▍                                            | 1/7 [00:00<00:01,  4.90it/s]Some weights of the model checkpoint were not used when initializing CLIPTextModel:
 ['text_model.embeddings.position_ids']
Loading pipeline components...:  71%|█████████████████████████████████████▏              | 5/7 [00:01<00:00,  3.80it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.50it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.27it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.27it/s]D:\one_ai\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1406: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 3632/3632 [09:33<00:00,  6.33it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 3632/3632 [01:35<00:00, 37.88it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.50it/s]
D:\one_ai\OneTrainer\venv\lib\site-packages\torch\autograd\graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ..\aten\src\ATen\native\cudnn\Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
step: 100%|██████████████████████████████████████| 3632/3632 [1:53:33<00:00,  1.88s/it, loss=0.0667, smooth loss=0.131]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.49it/s]
Saving workspace/run\save\LiveLeak_1e-05_Refined_RealVis2024-06-25_16-12-26-save-3632-1-0     | 0/3632 [00:00<?, ?it/s]
step:   0%|                                                                                   | 0/3632 [00:24<?, ?it/s]
epoch:  25%|█████████████████▊                                                     | 1/4 [2:05:10<6:15:32, 7510.67s/it]
Traceback (most recent call last):
  File "D:\one_ai\OneTrainer\modules\ui\TrainUI.py", line 538, in __training_thread_function
    trainer.train()
  File "D:\one_ai\OneTrainer\modules\trainer\GenericTrainer.py", line 572, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "D:\one_ai\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 276, in predict
    text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
  File "D:\one_ai\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 242, in __encode_text
    text_encoder_1_output = model.text_encoder_1(
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 807, in forward
    return self.text_model(
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 699, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 219, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 41, in forward
    return F.embedding(
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/LiveLeak_1e-05_Refined_RealVis

Output of `pip freeze`

No response

Nerogar / OneTrainer