[Bug]: OneTrainer crashing at 10% on second epoch (SDXL finetune)

velourlawsuits commented 1 week ago

What happened?

I have been training sdxl finetune models on OneTrainer for the past two months with great success until last night when my training session abruptly aborted at 10% on the second epoch. I restarted my computer, deleted the previous saved epoch, the backup I made and the sample images, and ran the training session again. The program once again aborted at 10% on epoch 2. I have since run the auto-update and pip install -r requirements.txt and I'm running it again. Hopefully it won't crash but I won't be around to debug so I'm opening this issue in advance.

What did you expect would happen?

I have # of epochs set to 10 so was expecting it to complete the training.

Relevant log output

activating venv D:\Stable Diffusion\OneTrainer\venv
Using Python "D:\Stable Diffusion\OneTrainer\venv\Scripts\python.exe"
D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\utils\generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\utils\generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.15.1 at http://localhost:6006/ (Press CTRL+C to quit)
Some weights of the model checkpoint were not used when initializing CLIPTextModel:
 ['text_model.embeddings.position_ids']
Some weights of the model checkpoint were not used when initializing CLIPTextModelWithProjection:
 ['text_model.embeddings.position_ids']
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.66it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.80it/s]D:\Stable Diffusion\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1279: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 4262/4262 [15:05<00:00,  4.71it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 4262/4262 [02:16<00:00, 31.32it/s]
caching resolutions: 100%|█████████████████████████████████████████████████████| 4262/4262 [00:00<00:00, 272951.26it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:27<00:00,  1.83it/s]
step: 100%|█████████████████████████████████████| 4262/4262 [5:05:19<00:00,  4.30s/it, loss=0.00347, smooth loss=0.118]
caching resolutions: 100%|█████████████████████████████████████████████████████| 4262/4262 [00:00<00:00, 272547.59it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:26<00:00,  1.85it/s]
Saving workspace/run\save\LiveLeak_1e-052024-06-21_10-03-41-save-4262-1-0                     | 0/4262 [00:00<?, ?it/s]
step:   0%|                                                                                   | 0/4262 [00:37<?, ?it/s]
epoch:  10%|██████▊                                                             | 1/10 [5:23:20<48:30:07, 19400.84s/it]
Traceback (most recent call last):
  File "D:\Stable Diffusion\OneTrainer\modules\ui\TrainUI.py", line 519, in __training_thread_function
    trainer.train()
  File "D:\Stable Diffusion\OneTrainer\modules\trainer\GenericTrainer.py", line 516, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 261, in predict
    text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 227, in __encode_text
    text_encoder_1_output = model.text_encoder_1(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 798, in forward
    return self.text_model(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 691, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 216, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 34, in forward
    return F.embedding(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/LiveLeak_1e-05

Output of `pip freeze`

No response

gilga2024 commented 1 week ago

Not knowing much about the code... But since the message is "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)" it might be helpful to provide your configurations in that context.

Maybe switch to a much smaller dataset and try to find out under which conditions it happens:

Are any parts of the training performed on CPU (training tab, EMA set to GPU or CPU)? If you are having EMA on CPU, try it on GPU
What kinds of things do you train (UNet, Text Encoder 1, Text Encoder 2)? Does it still happen when you deactivate training of UNet, Text Encoder 1, TEncoder 2?
Does it still happen if you switch to another Optimizer?

velourlawsuits commented 1 week ago

Do you work for Donald Glover's company lol

velourlawsuits commented 1 week ago

Activating the auto-update and/or -r requirements fixed the issue.

Nerogar / OneTrainer