Open k8tems opened 1 month ago
The SIGKILL is raised when the script reaches this line
dit
is initially loaded to the cpu in the get_models
function as bf16
https://github.com/XLabs-AI/x-flux/blob/03639685893d1f1328744ded62b7e51f412d90f0/train_flux_lora_deepspeed.py#L51
Since the dit
model is 11,907,011,648 parameters and each parameter takes 4 bytes,
I take it dit.to(torch.float32)
is supposed to consume 47,628,046,592 bytes(i.e. 44.35GB) of RAM?
Just to be sure, I modified torch/nn/modules/module.py
to see the RAM being consumed in real time
output:
VRAM 9.469 GB | RAM 4.444 GB / 44.082 GB (11.8%)
total_params_processed=5016967501
...
VRAM 9.469 GB | RAM 42.349 GB / 44.082 GB (97.8%)
total_params_processed=15183082829
VRAM 9.469 GB | RAM 42.458 GB / 44.082 GB (98.0%)
total_params_processed=15211394381
VRAM 9.469 GB | RAM 42.458 GB / 44.082 GB (98.0%)
total_params_processed=15211403597
E0818 12:39:45.949000 139699459244032 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 7522) of binary: /tmp/x-flux/venv/bin/python3.12
Traceback (most recent call last):
File "/tmp/x-flux/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/tmp/x-flux/venv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/tmp/x-flux/venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
deepspeed_launcher(args)
File "/tmp/x-flux/venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
distrib_run.run(args)
File "/tmp/x-flux/venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/tmp/x-flux/venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/x-flux/venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_flux_lora_deepspeed.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-18_12:39:45
host : n2ocox4owx
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 7522)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 7522
=====================================================
The process crashes after the script processes 15_211_403_597-5_016_967_501=10_194_436_096 dit parameters
If the model weights are initially loaded in bf16, subsequently upcasted to a sparse? float32 and finally passed to accelerate.prepare
(which probably converts it back to bf16 in cuda), I'm beginning to question if the intermediate float32 conversion is necessary at all.
I'm an idiot. I realize now that this is how mixed precision works. (Store the weights in fp32 and do the math in half-precision.) I wonder if there's anything else I can do to reduce the RAM usage.
I have the same issue, how to train the flux lora with FLUX with 24G 3090 hardware?
I have the same issue, how to train the flux lora with FLUX with 24G 3090 hardware?
I gave up on this repo and got it to work with ai-toolkit it works fine out of the box without the weird RAM issue.
Forgot to mention: I was able to bypass the error by commenting out the following line. https://github.com/XLabs-AI/x-flux/blob/03639685893d1f1328744ded62b7e51f412d90f0/train_flux_lora_deepspeed.py#L115 I have no idea if it breaks it or not. I decided to give up when I saw that the checkpoint was larger than 50GB might be an error on my side; maybe I was doing full finetuning instead of lora? I'm pretty sure I was running the script named trainfluxlora_deepspeed.py though...
Here's my ai-toolkit setup script if you're interested
cd /tmp
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
git submodule update --init --recursive
python3 -m venv venv
source venv/bin/activate
# .\venv\Scripts\activate on windows
# install torch first
pip3 install torch
pip3 install -r requirements.txt
# libGL.so.1
apt-get update
sudo apt-get install libgl1-mesa-glx
apt-get install -y libgl1-mesa-glx libglib2.0-0
huggingface-cli login
cat << EOF > "/tmp/config.yaml"
job: extension
config:
# this name will be the folder and filename name
name: "misaki"
process:
- type: 'sd_trainer'
# root folder to save training sessions/samples/weights
training_folder: "output"
# uncomment to see performance stats in the terminal every N steps
# performance_log_every: 1000
device: cuda:0
# if a trigger word is specified, it will be added to captions of training data if it does not already exist
# alternatively, in your captions you can add [trigger] and it will be replaced with the trigger word
# trigger_word: "Misaki"
network:
type: "lora"
linear: 16
linear_alpha: 16
save:
dtype: float16 # precision to save
save_every: 250 # save every this many steps
max_step_saves_to_keep: 4 # how many intermittent saves to keep
datasets:
# datasets are a folder of images. captions need to be txt files with the same name as the image
# for instance image2.jpg and image2.txt. Only jpg, jpeg, and png are supported currently
# images will automatically be resized and bucketed into the resolution specified
# on windows, escape back slashes with another backslash so
# "C:\\path\\to\\images\\folder"
- folder_path: "/notebooks/train_misaki_2023_08_30_20_36"
caption_ext: "txt"
caption_dropout_rate: 0.05 # will drop out the caption 5% of time
shuffle_tokens: false # shuffle caption order, split by commas
cache_latents_to_disk: true # leave this true unless you know what you're doing
resolution: [ 512, 768, 1024 ] # flux enjoys multiple resolutions
train:
batch_size: 1
steps: 4000 # total number of steps to train 500 - 4000 is a good range
gradient_accumulation_steps: 1
train_unet: true
train_text_encoder: false # probably won't work with flux
gradient_checkpointing: true # need the on unless you have a ton of vram
noise_scheduler: "flowmatch" # for training only
optimizer: "adamw8bit"
lr: 1e-4
# uncomment this to skip the pre training sample
# skip_first_sample: true
# uncomment to completely disable sampling
# disable_sampling: true
# uncomment to use new vell curved weighting. Experimental but may produce better results
# linear_timesteps: true
# ema will smooth out learning, but could slow it down. Recommended to leave on.
ema_config:
use_ema: true
ema_decay: 0.99
# will probably need this if gpu supports it for flux, other dtypes may not work correctly
dtype: bf16
model:
# huggingface model name or path
name_or_path: "black-forest-labs/FLUX.1-dev"
is_flux: true
quantize: true # run 8bit mixed precision
# low_vram: true # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower.
sample:
sampler: "flowmatch" # must match train.noise_scheduler
sample_every: 250 # sample every this many steps
width: 1024
height: 1024
prompts:
- "This image showcases a cat, Misaki, on a wooden floor. It appears that she is stretching her body out to reach for something on the other side of the room. The scene is set in a home environment with some household items visible in the background."
- "This image showcases a photograph of a cat named Misaki sitting on a bedspread."
neg: "" # not used on flux
seed: 42
walk_seed: true
guidance_scale: 4
sample_steps: 20
# you can add any additional meta info here. [name] is replaced with config name at top
meta:
name: "[name]"
version: '1.0'
EOF
python run.py /tmp/config.yaml
I'm currently trying to run the script with the following paperspace machine
When doing so, I get this error: