Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.34k stars 110 forks source link

[Bug]: OneTrainer crashing at 10% on second epoch (SDXL finetune) #354

Open velourlawsuits opened 1 week ago

velourlawsuits commented 1 week ago

What happened?

I have been training sdxl finetune models on OneTrainer for the past two months with great success until last night when my training session abruptly aborted at 10% on the second epoch. I restarted my computer, deleted the previous saved epoch, the backup I made and the sample images, and ran the training session again. The program once again aborted at 10% on epoch 2. I have since run the auto-update and pip install -r requirements.txt and I'm running it again. Hopefully it won't crash but I won't be around to debug so I'm opening this issue in advance.

What did you expect would happen?

I have # of epochs set to 10 so was expecting it to complete the training.

Relevant log output

activating venv D:\Stable Diffusion\OneTrainer\venv
Using Python "D:\Stable Diffusion\OneTrainer\venv\Scripts\python.exe"
D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\utils\generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\utils\generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.15.1 at http://localhost:6006/ (Press CTRL+C to quit)
Some weights of the model checkpoint were not used when initializing CLIPTextModel:
 ['text_model.embeddings.position_ids']
Some weights of the model checkpoint were not used when initializing CLIPTextModelWithProjection:
 ['text_model.embeddings.position_ids']
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.66it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.80it/s]D:\Stable Diffusion\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1279: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 4262/4262 [15:05<00:00,  4.71it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 4262/4262 [02:16<00:00, 31.32it/s]
caching resolutions: 100%|█████████████████████████████████████████████████████| 4262/4262 [00:00<00:00, 272951.26it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:27<00:00,  1.83it/s]
step: 100%|█████████████████████████████████████| 4262/4262 [5:05:19<00:00,  4.30s/it, loss=0.00347, smooth loss=0.118]
caching resolutions: 100%|█████████████████████████████████████████████████████| 4262/4262 [00:00<00:00, 272547.59it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:26<00:00,  1.85it/s]
Saving workspace/run\save\LiveLeak_1e-052024-06-21_10-03-41-save-4262-1-0                     | 0/4262 [00:00<?, ?it/s]
step:   0%|                                                                                   | 0/4262 [00:37<?, ?it/s]
epoch:  10%|██████▊                                                             | 1/10 [5:23:20<48:30:07, 19400.84s/it]
Traceback (most recent call last):
  File "D:\Stable Diffusion\OneTrainer\modules\ui\TrainUI.py", line 519, in __training_thread_function
    trainer.train()
  File "D:\Stable Diffusion\OneTrainer\modules\trainer\GenericTrainer.py", line 516, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 261, in predict
    text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
  File "D:\Stable Diffusion\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 227, in __encode_text
    text_encoder_1_output = model.text_encoder_1(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 798, in forward
    return self.text_model(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 691, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 216, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Stable Diffusion\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 34, in forward
    return F.embedding(
  File "D:\Stable Diffusion\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/LiveLeak_1e-05

Output of pip freeze

No response

gilga2024 commented 1 week ago

Not knowing much about the code... But since the message is "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)" it might be helpful to provide your configurations in that context.

Maybe switch to a much smaller dataset and try to find out under which conditions it happens:

velourlawsuits commented 1 week ago

Do you work for Donald Glover's company lol

velourlawsuits commented 1 week ago

Activating the auto-update and/or -r requirements fixed the issue.