Training went wrong, why not start training

aoyang-hd commented 2 months ago

The dataset I downloaded for the example config.env RESUME_CHECKPOINT='latest' DATALOADER_CONFIG='config/multidatabackend.json' ASPECT_BUCKET_ROUNDING='2' TRAINING_SEED='42' USE_EMA='false' USE_XFORMERS='false' MINIMUM_RESOLUTION='0' OUTPUT_DIR='output/models' USE_DORA='false' USE_BITFIT='false' LORA_TYPE='lycoris' LYCORIS_CONFIG='config/lycoris_config.json' PUSH_TO_HUB='false' PUSH_CHECKPOINTS='false' MAX_NUM_STEPS='30000' NUM_EPOCHS='0' CHECKPOINTING_STEPS='500' CHECKPOINTING_LIMIT='5' HUB_MODEL_NAME='simpletuner-lora' TRACKER_PROJECT_NAME='lora-training' TRACKER_RUN_NAME='simpletuner-lora' MODEL_TYPE='lora' MODEL_NAME='/mnt/cluster/aigc/ComfyUI/models/FLUX.1-dev/' FLUX='true' KOLORS='false' STABLE_DIFFUSION_3='false' STABLE_DIFFUSION_LEGACY='false' TRAIN_BATCH_SIZE='1' USE_GRADIENT_CHECKPOINTING='true' GRADIENT_ACCUMULATION_STEPS='2' CAPTION_DROPOUT_PROBABILITY='0.1' RESOLUTION_TYPE='pixel_area' RESOLUTION='1024' VALIDATION_SEED='42' VALIDATION_STEPS='500' VALIDATION_RESOLUTION='1024x1024' VALIDATION_GUIDANCE='3.0' VALIDATION_GUIDANCE_RESCALE='0.0' VALIDATION_NUM_INFERENCE_STEPS='20' VALIDATION_PROMPT='A photo-realistic image of a cat' ALLOW_TF32='true' MIXED_PRECISION='bf16' OPTIMIZER='adamw_bf16' LEARNING_RATE='1e-4' LR_SCHEDULE='polynomial' LR_WARMUP_STEPS='100' ACCELERATE_EXTRA_ARGS='' TRAINING_NUM_PROCESSES='1' TRAINING_NUM_MACHINES='1' VALIDATION_TORCH_COMPILE='false' TRAINER_DYNAMO_BACKEND='no' TRAINER_EXTRA_ARGS='--lr_end=1e-8 --compress_disk_cache --base_model_precision=int8-quanto --flux_lora_target=mmdit'

full log (flux_lora) root@ubuntu-2288H-V6:/mnt/cluster/aoyang/SimpleTuner# bash train.sh /mnt/cluster/aoyang/SimpleTuner/.venv/lib/python3.11/site-packages/nvidia/nvjitlink/lib 2024-08-23 17:51:20,978 [WARNING] (bitsandbytes.cextension) WARNING: BNB_CUDA_VERSION=121 environment variable detected; loading libbitsandbytes_cuda121.so. This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

optimizer: {'precision': 'bf16', 'default_settings': {'betas': (0.9, 0.999), 'weight_decay': 0.01, 'eps': 1e-06}, 'class': <class 'helpers.training.adam_bfloat16.AdamWBF16'>} 2024-08-23 17:51:21,044 [WARNING] (ArgsParser) The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead. 2024-08-23 17:51:21,044 [INFO] (ArgsParser) VAE Model: /mnt/cluster/aigc/ComfyUI/models/FLUX.1-dev/ 2024-08-23 17:51:21,044 [INFO] (ArgsParser) Default VAE Cache location: 2024-08-23 17:51:21,044 [INFO] (ArgsParser) Text Cache location: cache 2024-08-23 17:51:21,044 [WARNING] (ArgsParser) Updating T5 XXL tokeniser max length to 512 for Flux. 2024-08-23 17:51:21,044 [WARNING] (ArgsParser) Gradient accumulation steps are enabled, but gradient precision is set to 'unmodified'. This may lead to numeric instability. Consider disabling gradient accumulation steps. Continuing in 10 seconds.. 2024-08-23 17:51:31,549 [ERROR] (main) Failed to log into Hugging Face Hub: Invalid user token. If you didn't pass a user token, make sure you are properly logged in by executing `huggingface-cli login`, and if you did pass a user token, double-check it's correct. 2024-08-23 17:51:31,549 [INFO] (main) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32. 2024-08-23 17:51:31,550 [INFO] (main) Load tokenizers You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers 2024-08-23 17:51:31,894 [INFO] (helpers.training.text_encoding) Loading OpenAI CLIP-L text encoder from /mnt/cluster/aigc/ComfyUI/models/FLUX.1-dev//text_encoder.. 2024-08-23 17:51:31,926 [INFO] (helpers.training.text_encoding) Loading T5 XXL v1.1 text encoder from /mnt/cluster/aigc/ComfyUI/models/FLUX.1-dev//text_encoder_2.. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.70it/s] 2024-08-23 17:51:35,889 [INFO] (main) Load VAE: /mnt/cluster/aigc/ComfyUI/models/FLUX.1-dev/ 2024-08-23 17:51:35,957 [INFO] (main) Moving text encoder to GPU. 2024-08-23 17:51:36,149 [INFO] (main) Moving text encoder 2 to GPU. 2024-08-23 17:51:37,639 [INFO] (main) Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16 2024-08-23 17:51:37,672 [INFO] (DataBackendFactory) Loading data backend config from config/multidatabackend.json 2024-08-23 17:51:37,672 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of /mnt/cluster/aigc/ComfyUI/models/FLUX.1-dev/. Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 727.67it/s] 2024-08-23 17:51:37,684 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries 2024-08-23 17:51:37,993 [INFO] (DataBackendFactory) Pre-computing null embedding 2024-08-23 17:51:42,995 [INFO] (DataBackendFactory) Completed loading text embed services. 2024-08-23 17:51:42,996 [INFO] (DataBackendFactory) Configuring data backend: pseudo-camera-10k-flux 2024-08-23 17:51:42,998 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Loading bucket manager. 2024-08-23 17:51:43,000 [INFO] (JsonMetadataBackend) Checking for cache file: datasets/pseudo-camera-10k/aspect_ratio_bucket_indices.json 2024-08-23 17:51:43,000 [INFO] (JsonMetadataBackend) Pulling cache file from storage 2024-08-23 17:51:43,020 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Refreshing aspect buckets on main process. 2024-08-23 17:51:43,020 [INFO] (BaseMetadataBackend) Discovering new files... 2024-08-23 17:51:47,769 [INFO] (BaseMetadataBackend) Compressed 14102 existing files from 1. 2024-08-23 17:51:47,769 [INFO] (BaseMetadataBackend) No new files discovered. Doing nothing. 2024-08-23 17:51:47,769 [INFO] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 14102, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}} 2024-08-23 17:51:47,783 [WARNING] (DataBackendFactory) Key crop_aspect_buckets not found in the current backend config, using the existing value 'None'. 2024-08-23 17:51:47,783 [WARNING] (DataBackendFactory) Key disable_validation not found in the current backend config, using the existing value 'False'. 2024-08-23 17:51:47,783 [WARNING] (DataBackendFactory) Key config_version not found in the current backend config, using the existing value '1'. 2024-08-23 17:51:47,783 [WARNING] (DataBackendFactory) Key hash_filenames not found in the current backend config, using the existing value 'True'. 2024-08-23 17:51:47,783 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-flux', 'config': {'ignore_epochs': True, 'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 0.262144, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': 'datasets/pseudo-camera-10k', 'maximum_image_size': 0.262144, 'target_downsample_size': 0.262144, 'config_version': 1, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7618532938d0>, 'instance_data_dir': 'datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x761858615c10>} (Rank: 0) | Bucket | Image Count (per-GPU)

(Rank: 0) | 1.0 | 14102
2024-08-23 17:51:47,784 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Collecting captions. 2024-08-23 17:51:47,837 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Initialise text embed pre-computation using the filename caption strategy. We have 14102 captions to process. 2024-08-23 17:51:48,639 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Completed processing 14102 captions. 2024-08-23 17:51:48,639 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Creating VAE latent cache. 2024-08-23 17:51:48,850 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Discovering cache objects.. 2024-08-23 17:51:51,097 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-flux', 'config': {'ignore_epochs': True, 'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 0.262144, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': 'datasets/pseudo-camera-10k', 'maximum_image_size': 0.262144, 'target_downsample_size': 0.262144, 'config_version': 1, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7618532938d0>, 'instance_data_dir': 'datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x761858615c10>, 'train_dataset': <helpers.multiaspect.dataset.MultiAspectDataset object at 0x7618583695d0>, 'sampler': <helpers.multiaspect.sampler.MultiAspectSampler object at 0x761858cec850>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x761858369b50>, 'text_embed_cache': <helpers.caching.text_embeds.TextEmbeddingCache object at 0x76185abf88d0>, 'vaecache': <helpers.caching.vae.VAECache object at 0x76185b79e750>} 2024-08-23 17:51:51,930 [INFO] (validation) Precomputing the negative prompt embed for validations. 2024-08-23 17:51:52,368 [INFO] (main) Unloading text encoders, as they are not being trained. 2024-08-23 17:51:58,004 [INFO] (main) After nuking text encoders from orbit, we freed 9.11 GB of VRAM. The real memories were the friends we trained a model on along the way. 2024-08-23 17:51:58,554 [INFO] (main) Keeping some base model weights in torch.bfloat16. 2024-08-23 17:51:58,555 [INFO] (helpers.training.quantisation) Loading Quanto for LoRA training. This may take a few minutes. 2024-08-23 17:51:58,555 [INFO] (helpers.training.quantisation) Quantising FluxTransformer2DModel. Using int8-quanto. 2024-08-23 17:53:26,357 [INFO] (helpers.training.quantisation) Freezing model. 2024-08-23 17:53:40,956 [INFO] (main) Using lycoris training mode 2024-08-23 17:53:40|[LyCORIS]-INFO: Using rank adaptation algo: lora 2024-08-23 17:53:40|[LyCORIS]-INFO: Use Dropout value: 0.0 2024-08-23 17:53:40|[LyCORIS]-INFO: Create LyCORIS Module 2024-08-23 17:53:40|[LyCORIS]-WARNING: Using bnb/quanto/optimum-quanto with LyCORIS will enable force-bypass mode. 2024-08-23 17:53:42|[LyCORIS]-INFO: create LyCORIS: 342 modules. 2024-08-23 17:53:42|[LyCORIS]-INFO: module type table: {'LoConModule': 342} 2024-08-23 17:53:42,483 [INFO] (main) LyCORIS network has been initialized with 179,306,496 parameters 2024-08-23 17:53:42,483 [INFO] (main) Collected the following data backends: ['text-embeds', 'pseudo-camera-10k-flux'] 2024-08-23 17:53:42,483 [INFO] (main) Loading polynomial learning rate scheduler with 100 warmup steps 2024-08-23 17:53:42,487 [INFO] (main) Learning rate: 0.0001 2024-08-23 17:53:42,487 [INFO] (helpers.training.optimizer_param) cls: <class 'helpers.training.adam_bfloat16.AdamWBF16'>, settings: {'betas': (0.9, 0.999), 'weight_decay': 0.01, 'eps': 1e-06} 2024-08-23 17:53:42,493 [INFO] (main) Optimizer arguments, weight_decay=0.01 eps=1e-08, extra_arguments={'lr': 0.0001, 'betas': (0.9, 0.999), 'weight_decay': 0.01, 'eps': 1e-06} 2024-08-23 17:53:42,494 [INFO] (main) Loading polynomial learning rate scheduler with 100 warmup steps 2024-08-23 17:53:42,494 [INFO] (main) Using Polynomial learning rate scheduler with last epoch -2. 2024-08-23 17:53:42,499 [INFO] (SaveHookManager) Denoiser class set to: FluxTransformer2DModel. 2024-08-23 17:53:42,499 [INFO] (SaveHookManager) Pipeline class set to: FluxPipeline. 2024-08-23 17:53:42,499 [INFO] (main) Loading our accelerator... 2024-08-23 17:53:46,312 [INFO] (main) After removing any undesired samples and updating cache entries, we have settled on 5 epochs and 7051 steps per epoch. 2024-08-23 17:53:46,517 [INFO] (main) After nuking the VAE from orbit, we freed 163.84 MB of VRAM. 2024-08-23 17:53:46,518 [INFO] (main) Checkpoint 'latest' does not exist. Starting a new training run. 2024-08-23 17:53:46,525 [INFO] (MultiAspectSampler-pseudo-camera-10k-flux) (Rank: 0) -> Number of seen images: 0 (Rank: 0) -> Number of unseen images: 14102 (Rank: 0) -> Current Bucket: None (Rank: 0) -> 1 Buckets: ['1.0'] (Rank: 0) -> 0 Exhausted Buckets: [] wandb: Currently logged in as: 1522173817 (1522173817-jiangxi-university-of-science-and-technology). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.16.6 wandb: Run data is saved locally in /mnt/cluster/aoyang/SimpleTuner/wandb/run-20240823_175348-476a8ffcfe814f718fd4e35f6504b76e wandb: Run wandb offline to turn off syncing. wandb: Resuming run simpletuner-lora wandb: ⭐️ View project at https://wandb.ai/1522173817-jiangxi-university-of-science-and-technology/lora-training wandb: 🚀 View run at https://wandb.ai/1522173817-jiangxi-university-of-science-and-technology/lora-training/runs/476a8ffcfe814f718fd4e35f6504b76e 2024-08-23 17:53:53,357 [INFO] (main) Moving the diffusion transformer to GPU in int8-quanto precision. 2024-08-23 17:53:53,598 [INFO] (main) Running training

Num batches = 14102
Num Epochs = 5
- Current Epoch = 1
Total train batch size (w. parallel, distributed & accumulation) = 2
- Instantaneous batch size per device = 1
- Gradient Accumulation steps = 2
Total optimization steps = 30000
Total optimization steps remaining = 30000 Epoch 5/5, Steps: 0%| | 0/30000 [00:00<?, ?it/s][2024-08-23 17:53:54,017] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) 2024-08-23 17:53:54,474 [INFO] (main) Saving final LyCORIS checkpoint to output/models wandb:
wandb: 🚀 View run simpletuner-lora at: https://wandb.ai/1522173817-jiangxi-university-of-science-and-technology/lora-training/runs/476a8ffcfe814f718fd4e35f6504b76e wandb: ⭐️ View project at: https://wandb.ai/1522173817-jiangxi-university-of-science-and-technology/lora-training wandb: Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240823_175348-476a8ffcfe814f718fd4e35f6504b76e/logs Epoch 5/5, Steps: 0%| | 0/30000 [00:09<?, ?it/s]

Why did you just skip the training and end it before it started?

bghira commented 2 months ago

you probably have set ignore_epochs=true on your dataset..

bghira commented 2 months ago

2024-08-23 17:51:51,097 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-flux', 'config': {'ignore_epochs': True, 'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 0.262144, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': 'datasets/pseudo-camera-10k', 'maximum_image_size': 0.262144, 'target_downsample_size': 0.262144, 'config_version': 1, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7618532938d0>, 'instance_data_dir': 'datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x761858615c10>, 'train_dataset': <helpers.multiaspect.dataset.MultiAspectDataset object at 0x7618583695d0>, 'sampler': <helpers.multiaspect.sampler.MultiAspectSampler object at 0x761858cec850>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x761858369b50>, 'text_embed_cache': <helpers.caching.text_embeds.TextEmbeddingCache object at 0x76185abf88d0>, 'vaecache': <helpers.caching.vae.VAECache object at 0x76185b79e750>}

bghira / SimpleTuner

Training went wrong, why not start training #855