Started getting an error running training

renatomserra commented 2 months ago

Hello started getting this error with the same container, any ideas?

No dependencies to install or update
Traceback (most recent call last):
  File "/root/SimpleTuner/.venv/bin/accelerate", line 5, in <module>
    from accelerate.commands.accelerate_cli import main
  File "/root/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/__init__.py", line 16, in <module>
    from .accelerator import Accelerator
  File "/root/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 32, in <module>
    import torch
  File "/root/SimpleTuner/.venv/lib/python3.11/site-packages/torch/__init__.py", line 368, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

my config:

{
    "--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
    "--model_family": "flux",
    "--model_type": "lora",
    "--lora_type": "standard",
    "--lora_rank": 16,
    "--flux_lora_target": "all+ffs",
    "--optimizer": "adamw_bf16",
    "--train_batch_size": 1,
    "--gradient_accumulation_steps": 1,
    "--learning_rate": "4e-4",
    "--max_train_steps": 2000,
    "--num_train_epochs": 0,
    "--checkpointing_steps": 500,
    "--validation_steps": 200,
    "--validation_prompt": "A full-body action shot of Chillychills the cat",
    "--validation_seed": 42,
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 3.5,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "28",
    "--validation_negative_prompt": "",
    "--hub_model_id": "flux-lora-123456",
    "--tracker_project_name": "flux-lora-123456",
    "--tracker_run_name": "flux-lora-123456",
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/multidatabackend.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--output_dir": "/root/SimpleTuner/output/models",
    "--checkpoints_total_limit": 2,
    "--push_to_hub": "true",
    "--push_checkpoints_to_hub": "true",
    "--report_to": "none",
    "--flux_guidance_value": 1.0,
    "--max_grad_norm": 1.0,
    "--flux_schedule_auto_shift": "true",
    "--validation_on_startup": "true",
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.05,
    "--vae_batch_size": 1,
    "--allow_tf32": "true",
    "--resolution_type": "pixel_area",
    "--resolution": 512,
    "--mixed_precision": "bf16",
    "--lr_scheduler": "constant_with_warmup",
    "--lr_warmup_steps": 100,
    "--metadata_update_interval": 60,
    "--validation_torch_compile": "false"
}

AmericanPresidentJimmyCarter commented 2 months ago

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

Sounds like nvidia drivers are not installed correctly.

renatomserra commented 2 months ago

hmm strange, im following the guide like i have been before and it stopped working 🤔

AmericanPresidentJimmyCarter commented 2 months ago

Sometimes something as simple as an apt update can bork nvidia drivers. What does nvidia-smi show?

renatomserra commented 2 months ago

Tried apt update, no change

nvidia-smi:

AmericanPresidentJimmyCarter commented 2 months ago

Hmm, ok.

https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

Go there and try following the NCCL installation instructions.

renatomserra commented 2 months ago

Will give this a try

are you still able to run simpletuner in vast ai instances using the same docker image in the docs?

AmericanPresidentJimmyCarter commented 2 months ago

I had been as of a week ago

renatomserra commented 2 months ago

Yeah it was working for me until 2 days ago.

AmericanPresidentJimmyCarter commented 2 months ago

@bghira says to try the pytorch/pytorch_2.4.0-cuda12.4-cudnn9-devel image. If that helps I will update the guide.

bghira commented 2 months ago

i started that one up freshly on a 3090, 4090, A100 and H100 to test and they all worked well. the problem is the default image selected by some vendors like Vast has CUDA 11.8 or 11.5 in there (yikes) and pytorch 2.6 no longer supports these

renatomserra commented 2 months ago

YEap just tested with that image and it does work, thanks a lot guys!

AmericanPresidentJimmyCarter / simple-flux-lora-training

Started getting an error running training #4