Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.59k stars 128 forks source link

[Bug]: train device cuda:1 #261

Open ToJl9TopTonop opened 4 months ago

ToJl9TopTonop commented 4 months ago

What happened?

image

train device cuda:1 does not work. train device cuda:0 works. train device cuda works.

Simple solution (crutch): In start-ui.bat add set CUDA_VISIBLE_DEVICES=1 image

What did you expect would happen?

there are video cards: RTX 4060 ti 16gb - cuda:0 or CUDA_VISIBLE_DEVICES=0 tesla p40 24gb - cuda:1 or CUDA_VISIBLE_DEVICES=1 <- I need to select this one tesla p4 8gb - cuda:2 or CUDA_VISIBLE_DEVICES=2

Relevant log output

Exception in thread Thread-1 (__training_thread_function):
Traceback (most recent call last):
  File "C:\Users\IBers\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\IBers\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "D:\OneTrainer-master\modules\ui\TrainUI.py", line 475, in __training_thread_function
    ZLUDA.initialize_devices(self.train_config)
  File "D:\OneTrainer-master\modules\zluda\ZLUDA.py", line 34, in initialize_devices
    if not is_zluda(config.train_device) and not is_zluda(config.temp_device):
  File "D:\OneTrainer-master\modules\zluda\ZLUDA.py", line 12, in is_zluda
    return torch.cuda.get_device_name(device).endswith("[ZLUDA]")
  File "D:\OneTrainer-master\venv\lib\site-packages\torch\cuda\__init__.py", line 423, in get_device_name
    return get_device_properties(device).name
  File "D:\OneTrainer-master\venv\lib\site-packages\torch\cuda\__init__.py", line 456, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

Output of pip freeze

No response

Nerogar commented 4 months ago

I don't have multiple GPUs to test this. But it seems a bit strange that cuda:0 works, and cuda:1 doesn't. The device name "cuda:1" is just passed to pytorch without any additional checks.

jjohare commented 4 months ago

curiously I am also finding this, when I did not previously.

noisefloordev commented 3 months ago

My understanding is that setting "CUDA_VISIBLE_DEVICES=1" makes CUDA only expose a single device to the application. You're telling the application to use the second device, but as far as it can tell there's only one (the one that CUDA is exposing to it). So I think you either want to use CUDA_VISIBLE_DEVICES or select a device in the application, not both.

(At least that was my experience, I ran into the same problem with the SD webui when trying to select a GPU)

O-J1 commented 1 month ago

@ToJl9TopTonop Please confirm if this is still an issue for you and if not, what was your solution so I can document it in the wiki.