how to select Cuda device to train lora ? (train_flux_lora_deepspeed.py)

AfterHAL commented 3 months ago

How can I run de LoRA trainer on my second Cuda device ? It seems to be working nice on the first cuda device :

...
08/22/2024 17:04:37 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 8
Process index: 6
Local process index: 6
Device: cuda:0

but this one ain't got enough Vram, so I wanted to start the Lora training script on Device Cuda:1 (which is the second).

Unfortunately, the next command doesn't sets the cuda device to index 1 : CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file "accelconfig.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

any advice ?

Anghellia commented 3 months ago

hi, try please these options: CUDA_VISIBLE_DEVICES="1" accelerate launch

or write inside training script in the beginning:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

AfterHAL commented 3 months ago

hi, try please these options: CUDA_VISIBLE_DEVICES="1" accelerate launch

or write inside training script in the beginning:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

Thanks a lot @Anghellia . My bad ! I'm using Ubuntu under windows WSL 2 and the cuda devices are sorted by "power" by default CUDA_DEVICE_ORDER=FASTEST_FIRST , while they are sorted by PCIBus number under Windows with CUDA_DEVICE_ORDER=PCI_BUS_ID. So, in my case, the cuda devices 0,1,2,3 under windows becomes 1,0,2,3 under linux.

And there is something strange happening : When CUDA_DEVICE_ORDER is not set, accelerate defaults to the first cuda device on the first on PCIBus, while using CUDA_VISIBLE_DEVICES sets the device per "faster" order ...

By the way, I am now facing and error that I don't understand after "Init AE" :

(xflux_env) xtash@PS3:/mnt/w/xFluxLinux$ CUDA_VISIBLE_DEVICES=0 accelerate launch train_flux_lora_deepspeed.py --config "train_configs/test_lo
ra.yaml"
[2024-08-22 19:38:56,617] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[2024-08-22 19:39:40,294] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[2024-08-22 19:39:44,808] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-22 19:39:44,808] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
08/22/2024 19:39:44 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 'auto', 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.23it/s]
Init model
Loading checkpoint
Init AE
E0822 19:41:38.953000 139846884818944 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 100376) of binary: /mnt/w/xFluxLinux/xflux_env/bin/python3
Traceback (most recent call last):
  File "/mnt/w/xFluxLinux/xflux_env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
    deepspeed_launcher(args)
  File "/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
    distrib_run.run(args)
  File "/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/w/xFluxLinux/xflux_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_flux_lora_deepspeed.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-22_19:41:38
  host      : PS3.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 100376)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 100376
=======================================================

XLabs-AI / x-flux

how to select Cuda device to train lora ? (train_flux_lora_deepspeed.py) #72