bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.73k stars 154 forks source link

No images were discovered by the bucket manager #334

Closed elismasilva closed 6 months ago

elismasilva commented 7 months ago

Hi , now i am trying to train SDXL with images of resolutin 768x768, i set batch size for 2 in env file so i have 16 images on folder. I am using resolution type 'pixel' an resolution '768' both env and multidatabackend.

here my backendfile:

[
    {
        "id": "pelolisu",
        "type": "local",
        "instance_data_dir": "/home/master/wsl-ntfs/dataset",
        "crop": false,
        "crop_style": "random",
        "crop_aspect": "preserve",
        "resolution": 768,
        "resolution_type": "pixel",
        "minimum_image_size": 1,
        "prepend_instance_prompt": true,
        "instance_prompt": "ohwx man",
        "only_instance_prompt": false,
        "caption_strategy": "textfile",
        "cache_dir_vae": "/home/master/wsl-ntfs/vaecache",
        "vae_cache_clear_each_epoch": false,
        "probability": 1.0,
        "repeats": 5,
        "text_embeds": "pelolisu-embed-cache",
        "preserve_data_backend_cache": true
    },
    {
        "id": "pelolisu-embed-cache",
        "dataset_type": "text_embeds",
        "default": true,
        "type": "local",
        "cache_dir": "/home/master/wsl-ntfs/textembed_cache"
    }
]
2024-03-28 21:44:46,638 [INFO] (ArgsParser) VAE Model: madebyollin/sdxl-vae-fp16-fix
2024-03-28 21:44:46,639 [INFO] (ArgsParser) Default VAE Cache location: /home/master/wsl-ntfs/dataset/models/cache_vae
2024-03-28 21:44:46,639 [INFO] (ArgsParser) Text Cache location: cache
[2024-03-28 21:44:46,665] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-28 21:44:46,784] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-28 21:44:46,784] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2024-03-28 21:44:46,812 [INFO] (__main__) Updated gradient_accumulation_steps to the value provided by DeepSpeed: 1
2024-03-28 21:44:46,813 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-03-28 21:44:46,813 [INFO] (__main__) Load tokenizers
2024-03-28 21:44:46,907 [INFO] (__main__) Load text encoder 1..
2024-03-28 21:44:47,002 [INFO] (__main__) Load text encoder 2..
2024-03-28 21:44:47,454 [INFO] (__main__) Load VAE..
2024-03-28 21:44:47,855 [INFO] (__main__) Moving models to GPU. Almost there.
2024-03-28 21:44:48,295 [INFO] (__main__) Creating the U-net..
2024-03-28 21:44:49,896 [INFO] (__main__) Moving the U-net to GPU.
2024-03-28 21:44:51,189 [INFO] (__main__) Enabling xformers memory-efficient attention.
2024-03-28 21:44:52,433 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp16, fp32, default
2024-03-28 21:44:52,486 [INFO] (__main__) Loaded VAE into VRAM.
2024-03-28 21:44:52,502 [INFO] (DataBackendFactory) Configuring text embed backend: pelolisu-embed-cache
2024-03-28 21:44:52,503 [INFO] (TextEmbeddingCache) (Rank: 0) (id=pelolisu-embed-cache) Listing all text embed cache entries
2024-03-28 21:44:52,504 [INFO] (DataBackendFactory) Pre-computing null embedding for caption dropout

Write embeds to disk:   0%|                                                                            | 0/1 [00:00<?, ?it/s]

Processing prompts:   0%|                                                                              | 0/1 [00:00<?, ?it/s]

2024-03-28 21:44:52,665 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-03-28 21:44:52,665 [INFO] (DataBackendFactory) Configuring data backend: pelolisu
2024-03-28 21:44:52,666 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image'}
2024-03-28 21:44:52,666 [INFO] (DataBackendFactory) (id=pelolisu) Loading bucket manager.
2024-03-28 21:44:52,672 [INFO] (DataBackendFactory) (id=pelolisu) Refreshing aspect buckets on main process.
2024-03-28 21:44:52,672 [INFO] (BaseMetadataBackend) Discovering new files...
2024-03-28 21:44:52,682 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (18).jpg.
2024-03-28 21:44:52,686 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (10).jpg.
2024-03-28 21:44:52,689 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (12).jpg.
2024-03-28 21:44:52,692 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (15).jpg.
2024-03-28 21:44:52,695 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (17).jpg.
2024-03-28 21:44:52,710 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (2).jpg.2024-03-28 21:44:52,716 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,710 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (19).jpg.
2024-03-28 21:44:52,720 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (3).jpg.

2024-03-28 21:44:52,717 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,720 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (20).jpg.

2024-03-28 21:44:52,728 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (5).jpg.
2024-03-28 21:44:52,728 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (11).jpg.
2024-03-28 21:44:52,730 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,730 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (4).jpg.
2024-03-28 21:44:52,731 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (1).jpg.

Generating aspect bucket cache:   0%|                                        | 0/16 [00:00<?, ?it/s]2024-03-28 21:44:52,733 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (16).jpg.2024-03-28 21:44:52,733 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (6).jpg.

2024-03-28 21:44:52,739 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,742 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}

2024-03-28 21:44:52,761 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,762 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (14).jpg.

2024-03-28 21:44:52,768 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,772 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,774 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,774 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,775 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,776 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 21:44:52,777 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 21:44:52,781 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,781 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 21:44:52,782 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 21:44:52,782 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 21:44:52,785 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,786 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,787 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 21:44:52,787 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 21:44:52,791 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 21:44:52,791 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 21:44:52,794 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 21:44:52,800 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 21:44:52,798 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,805 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 21:44:52,807 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 21:44:52,834 [INFO] (BaseMetadataBackend) Image processing statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 21:44:52,836 [INFO] (BaseMetadataBackend) Completed aspect bucket update.
2024-03-28 21:44:52,836 [DEBUG] (BaseMetadataBackend) Refreshing buckets for rank (Rank: 0)  via data_backend id pelolisu.
2024-03-28 21:44:52,837 [DEBUG] (BaseMetadataBackend) Before updating, in all buckets, we had 0.
2024-03-28 21:44:52,837 [DEBUG] (BaseMetadataBackend) After updating, in all buckets, we had 0.
2024-03-28 21:44:52,838 [DEBUG] (BaseMetadataBackend) Count of items before split: 0
2024-03-28 21:44:52,838 [DEBUG] (BaseMetadataBackend) Count of items after split: 0
2024-03-28 21:44:52,839 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'preserve', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7fbd19cc2a10>, 'instance_data_root': '/home/master/wsl-ntfs/dataset', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7fbd19cc3b50>}
(Rank: 0)  | Bucket     | Image Count 
------------------------------
2024-03-28 21:44:52,843 [ERROR] (__main__) No images were discovered by the bucket manager in the dataset: pelolisu., traceback: Traceback (most recent call last):
  File "/mnt/f/Projetos/SimpleTuner/train_sdxl.py", line 428, in main
    configure_multi_databackend(
  File "/mnt/f/Projetos/SimpleTuner/helpers/data_backend/factory.py", line 492, in configure_multi_databackend
    raise Exception(
Exception: No images were discovered by the bucket manager in the dataset: pelolisu.
bghira commented 7 months ago

minimum_image_size should probably be set to 768 instead of 1. what are the actual sizes of your images?

elismasilva commented 7 months ago

minimum_image_size should probably be set to 768 instead of 1. what are the actual sizes of your images?

i tried with 768 dosnt worked then i changed to 1 but not work too, my images is 768x768. I removed folder models and caches to rebuild again but error persists.

elismasilva commented 7 months ago

my env file:

# Configure these values.
# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='full'
# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false
# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
if [[ "$MODEL_TYPE" == "full" ]]; then
    # When training a full model, we will rely on BitFit to keep the u-net intact.
    export USE_BITFIT=true
elif [[ "$MODEL_TYPE" == "lora" ]]; then
    # As of v0.9.2 of SimpleTuner, LoRA can not use BitFit.
    export USE_BITFIT=false
fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=150
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=3

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=4e-6 #@param {type:"number"}

# Using a Huggingface Hub model:
#export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
# Using a local path to a huggingface hub model or saved checkpoint:
export MODEL_NAME="/home/master/wsl-ntfs/models/RealVisXL_V4.0"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
export DEBUG_EXTRA_ARGS="--report_to=wandb"
export TRACKER_PROJECT_NAME="sdxl-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"

# Max number of steps OR epochs can be used. But we default to Epochs.
export MAX_NUM_STEPS=0
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=150

# A convenient prefix for all of your training paths.
export BASE_DIR="/home/master/wsl-ntfs/dataset"
export DATALOADER_CONFIG="${BASE_DIR}/multidatabackend.json"
export OUTPUT_DIR="${BASE_DIR}/models"
# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
export RESOLUTION=768
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION_TYPE="pixel"
#export RESOLUTION=0.75          # 1.0 Megapixel training sizes
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="portrait photo of (ohwx man:1.1) wearing an expensive  suit, white background, fit"
export VALIDATION_GUIDANCE=7.5
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="(blue eyes, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), fat, text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION

# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=2
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=1

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="sine"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=1000

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0.1

export METADATA_UPDATE_INTERVAL=65
export VAE_BATCH_SIZE=12

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=0
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=true

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=true

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=true
# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# Choices: adamw, adamw8bit, adafactor, dadaptation
export OPTIMIZER="adamw8bit"

# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS=""
## For offset noise training:
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.02"

## For noise input perturbation - adds extra noise, randomly. This is separate from offset noise, but can help stabilize it and reduce overfitting.
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --input_perturbation=0.01"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"

export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --override_dataset_config"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offload_param_path=/home/master/wsl-ntfs/offload"
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --fully_unload_text_encoder"
# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=-1 #420420420

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=1
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # 'inductor' or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)
bghira commented 7 months ago

can you remove the .json files the trainer created in your data dir then try again with the minimum resolution setting removed?

elismasilva commented 7 months ago

can you remove the .json files the trainer created in your data dir then try again with the minimum resolution setting removed?

ok it worked, now i am getting other problem, its very strange, its now with num batches 1 and only 150 steps, when i was running with 8 images and batch size 1 in env file this number was 8 and with 2000 steps.

2024-03-28 22:46:52,929 [INFO] (__main__) ***** Running training *****
2024-03-28 22:46:52,929 [INFO] (__main__)  -> Num batches = 1
2024-03-28 22:46:52,929 [INFO] (__main__)  -> Num Epochs = 150
2024-03-28 22:46:52,930 [INFO] (__main__)  -> Current Epoch = 1
2024-03-28 22:46:52,930 [INFO] (__main__)  -> Instantaneous batch size per device = 2
2024-03-28 22:46:52,931 [INFO] (__main__)  -> Gradient Accumulation steps = 1
2024-03-28 22:46:52,931 [INFO] (__main__)    -> Total train batch size (w. parallel, distributed & accumulation) = 2
2024-03-28 22:46:52,931 [INFO] (__main__)  -> Total optimization steps = 150
2024-03-28 22:46:52,932 [INFO] (__main__)  -> Total optimization steps remaining = 150
elismasilva commented 7 months ago

Logs:

2024-03-28 22:46:17,412 [INFO] (ArgsParser) VAE Model: madebyollin/sdxl-vae-fp16-fix
2024-03-28 22:46:17,412 [INFO] (ArgsParser) Default VAE Cache location: /home/master/wsl-ntfs/dataset/models/cache_vae
2024-03-28 22:46:17,412 [INFO] (ArgsParser) Text Cache location: cache
[2024-03-28 22:46:17,441] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-28 22:46:17,566] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-28 22:46:17,566] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2024-03-28 22:46:17,592 [INFO] (__main__) Updated gradient_accumulation_steps to the value provided by DeepSpeed: 1
2024-03-28 22:46:17,593 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-03-28 22:46:17,593 [INFO] (__main__) Load tokenizers
2024-03-28 22:46:17,689 [INFO] (__main__) Load text encoder 1..
2024-03-28 22:46:17,782 [INFO] (__main__) Load text encoder 2..
2024-03-28 22:46:18,210 [INFO] (__main__) Load VAE..
2024-03-28 22:46:18,598 [INFO] (__main__) Moving models to GPU. Almost there.
2024-03-28 22:46:19,051 [INFO] (__main__) Creating the U-net..
2024-03-28 22:46:20,632 [INFO] (__main__) Moving the U-net to GPU.
2024-03-28 22:46:21,861 [INFO] (__main__) Enabling xformers memory-efficient attention.
2024-03-28 22:46:23,134 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp16, fp32, default
2024-03-28 22:46:23,182 [INFO] (__main__) Loaded VAE into VRAM.
2024-03-28 22:46:23,198 [INFO] (DataBackendFactory) Configuring text embed backend: pelolisu-embed-cache
2024-03-28 22:46:23,199 [INFO] (TextEmbeddingCache) (Rank: 0) (id=pelolisu-embed-cache) Listing all text embed cache entries
2024-03-28 22:46:23,200 [INFO] (DataBackendFactory) Pre-computing null embedding for caption dropout
2024-03-28 22:46:23,280 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-03-28 22:46:23,281 [INFO] (DataBackendFactory) Configuring data backend: pelolisu
2024-03-28 22:46:23,281 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image'}
2024-03-28 22:46:23,281 [INFO] (DataBackendFactory) (id=pelolisu) Loading bucket manager.
2024-03-28 22:46:23,286 [INFO] (DataBackendFactory) (id=pelolisu) Refreshing aspect buckets on main process.
2024-03-28 22:46:23,287 [INFO] (BaseMetadataBackend) Discovering new files...
2024-03-28 22:46:23,297 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (18).jpg.
2024-03-28 22:46:23,300 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (10).jpg.
2024-03-28 22:46:23,304 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (12).jpg.
2024-03-28 22:46:23,306 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (15).jpg.
2024-03-28 22:46:23,311 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (17).jpg.
2024-03-28 22:46:23,327 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,328 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (1).jpg.
2024-03-28 22:46:23,330 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (2).jpg.2024-03-28 22:46:23,327 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (19).jpg.2024-03-28 22:46:23,334 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,334 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,336 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (3).jpg.2024-03-28 22:46:23,336 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (20).jpg.2024-03-28 22:46:23,339 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,342 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (4).jpg.

2024-03-28 22:46:23,346 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,353 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (14).jpg.
2024-03-28 22:46:23,345 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (11).jpg.

2024-03-28 22:46:23,354 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (6).jpg.2024-03-28 22:46:23,354 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (5).jpg.
2024-03-28 22:46:23,354 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
Generating aspect bucket cache:   0%|                                        | 0/16 [00:00<?, ?it/s]

2024-03-28 22:46:23,366 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,366 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 22:46:23,367 [DEBUG] (BaseMetadataBackend) Setting metadata for /home/master/wsl-ntfs/dataset/ohwx (18).jpg to {'original_size': (768, 768), 'crop_coordinates': (0, 0), 'target_size': (768, 768), 'aspect_ratio': 1.0, 'luminance': 161.26434628295902}.2024-03-28 22:46:23,367 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (16).jpg.

2024-03-28 22:46:23,368 [DEBUG] (BaseMetadataBackend) Setting metadata for /home/master/wsl-ntfs/dataset/ohwx (1).jpg to {'original_size': (768, 768), 'crop_coordinates': (0, 0), 'target_size': (768, 768), 'aspect_ratio': 1.0, 'luminance': 126.0229090898302}.
2024-03-28 22:46:23,368 [DEBUG] (BaseMetadataBackend) Received statistics update: {'total_processed': 2, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,373 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,380 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,383 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,383 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 22:46:23,385 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 22:46:23,386 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,386 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 22:46:23,397 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,388 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}

2024-03-28 22:46:23,403 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,403 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 22:46:23,403 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 22:46:23,403 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 22:46:23,404 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 22:46:23,404 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 22:46:23,404 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 22:46:23,406 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 22:46:23,406 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 22:46:23,411 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,412 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 22:46:23,470 [INFO] (BaseMetadataBackend) Image processing statistics: {'total_processed': 2, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 22:46:23,471 [INFO] (BaseMetadataBackend) Enforcing minimum image size of 768.0. This could take a while for very-large datasets.
2024-03-28 22:46:23,471 [INFO] (BaseMetadataBackend) Completed aspect bucket update.
2024-03-28 22:46:23,472 [DEBUG] (BaseMetadataBackend) Refreshing buckets for rank (Rank: 0)  via data_backend id pelolisu.
2024-03-28 22:46:23,472 [DEBUG] (BaseMetadataBackend) Before updating, in all buckets, we had 2.
2024-03-28 22:46:23,472 [DEBUG] (BaseMetadataBackend) After updating, in all buckets, we had 2.
2024-03-28 22:46:23,473 [DEBUG] (BaseMetadataBackend) Count of items before split: 2
2024-03-28 22:46:23,474 [DEBUG] (BaseMetadataBackend) Trimmed from 2 to 2
2024-03-28 22:46:23,474 [DEBUG] (BaseMetadataBackend) Count of items after split: 2
2024-03-28 22:46:23,474 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7fb9c0ddbf40>, 'instance_data_root': '/home/master/wsl-ntfs/dataset', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7fb9c0ddbb50>}
(Rank: 0)  | Bucket     | Image Count 
------------------------------
(Rank: 0)  | 1.0        | 2           

Loading captions:   0%|                                                                               | 0/16 [00:00<?, ?it/s]
Loading captions: 100%|████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 4040.27it/s]
2024-03-28 22:46:23,481 [INFO] (DataBackendFactory) (id=pelolisu) Initialise text embed pre-computation using the textfile caption strategy. We have 16 captions to process.

Write embeds to disk:   0%|                                                                           | 0/16 [00:00<?, ?it/s]

Processing prompts:   0%|                                                                             | 0/16 [00:00<?, ?it/s]
Write embeds to disk:  12%|████████▍                                                          | 2/16 [00:00<00:00, 19.36it/s]

Processing prompts:  19%|████████████▉                                                        | 3/16 [00:00<00:00, 25.02it/s]
Write embeds to disk:  44%|█████████████████████████████▎                                     | 7/16 [00:00<00:00, 35.06it/s]

Processing prompts:  50%|██████████████████████████████████▌                                  | 8/16 [00:00<00:00, 36.61it/s]
Write embeds to disk:  75%|█████████████████████████████████████████████████▌                | 12/16 [00:00<00:00, 38.15it/s]

Processing prompts:  81%|███████████████████████████████████████████████████████▎            | 13/16 [00:00<00:00, 38.71it/s]

2024-03-28 22:46:23,907 [INFO] (DataBackendFactory) (id=pelolisu) Completed processing 16 captions.
2024-03-28 22:46:23,907 [INFO] (DataBackendFactory) (id=pelolisu) Pre-computing VAE latent space.
2024-03-28 22:46:23,909 [INFO] (DataBackendFactory) Skipping error scan for dataset pelolisu. Set 'scan_for_errors' to True in the dataset config to enable this if your training runs into mismatched latent dimensions.

Processing bucket 1.0:   0%|                                                                           | 0/2 [00:00<?, ?it/s]

2024-03-28 22:46:23,980 [DEBUG] (BaseMetadataBackend) Setting metadata for /home/master/wsl-ntfs/dataset/ohwx (18).jpg to {'original_size': [768, 768], 'crop_coordinates': (0, 0), 'target_size': [768, 768], 'aspect_ratio': 1.0, 'luminance': 161.26434628295902}.
2024-03-28 22:46:23,983 [DEBUG] (BaseMetadataBackend) Setting metadata for /home/master/wsl-ntfs/dataset/ohwx (1).jpg to {'original_size': [768, 768], 'crop_coordinates': (0, 0), 'target_size': [768, 768], 'aspect_ratio': 1.0, 'luminance': 126.0229090898302}.
2024-03-28 22:46:36,302 [INFO] (VAECache) Bucket 1.0 caching results: {'not_local': 0, 'already_cached': 0, 'cached': 0, 'total': 2}
2024-03-28 22:46:36,479 [INFO] (validation) Precomputing the negative prompt embed for validations.
Token indices sequence length is longer than the specified maximum sequence length for this model (129 > 77). Running this sequence through the model will result in indexing errors
2024-03-28 22:46:36,482 [WARNING] (TextEmbeddingCache) The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck']
Token indices sequence length is longer than the specified maximum sequence length for this model (129 > 77). Running this sequence through the model will result in indexing errors
2024-03-28 22:46:36,610 [INFO] (__main__) Unloading text encoders, as they are not being trained.
2024-03-28 22:46:36,823 [INFO] (__main__) After nuking text encoders from orbit, we freed 0.0 GB of VRAM. The real memories were the friends we trained a model on along the way.
2024-03-28 22:46:36,824 [INFO] (__main__) Collected the following data backends: ['pelolisu-embed-cache', 'pelolisu']
2024-03-28 22:46:36,824 [INFO] (__main__) Calculated our maximum training steps at 150 because we have 150 epochs and 1 steps per epoch.
2024-03-28 22:46:36,824 [INFO] (__main__) Loading sine learning rate scheduler with 1000 warmup steps
2024-03-28 22:46:36,829 [INFO] (__main__) Learning rate: 4e-06
2024-03-28 22:46:36,829 [INFO] (__main__) Using DeepSpeed optimizer.
2024-03-28 22:46:36,829 [INFO] (__main__) Using DeepSpeed learning rate scheduler
2024-03-28 22:46:36,843 [INFO] (__main__) Loading our accelerator...
[2024-03-28 22:46:36,848] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.0, git-hash=unknown, git-branch=unknown
[2024-03-28 22:46:37,178] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000004, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2024-03-28 22:46:39,146] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-03-28 22:46:39,146] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-03-28 22:46:39,336] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2024-03-28 22:46:39,336] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2024-03-28 22:46:39,336] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-03-28 22:46:39,336] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500,000,000
[2024-03-28 22:46:39,336] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500,000,000
[2024-03-28 22:46:39,337] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: True
[2024-03-28 22:46:39,337] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2024-03-28 22:46:46,168] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-03-28 22:46:46,169] [INFO] [utils.py:801:see_memory_usage] MA 6.54 GB         Max_MA 6.54 GB         CA 6.95 GB         Max_CA 7 GB 
[2024-03-28 22:46:46,169] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 19.4 GB, percent = 34.1%
[2024-03-28 22:46:47,164] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-03-28 22:46:47,165] [INFO] [utils.py:801:see_memory_usage] MA 6.54 GB         Max_MA 6.54 GB         CA 6.95 GB         Max_CA 7 GB 
[2024-03-28 22:46:47,165] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 24.19 GB, percent = 42.5%
[2024-03-28 22:46:47,165] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized
[2024-03-28 22:46:47,257] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-03-28 22:46:47,258] [INFO] [utils.py:801:see_memory_usage] MA 6.54 GB         Max_MA 6.54 GB         CA 6.95 GB         Max_CA 7 GB 
[2024-03-28 22:46:47,258] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 24.19 GB, percent = 42.5%
[2024-03-28 22:46:48,849] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2024-03-28 22:46:48,849] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupLR
[2024-03-28 22:46:48,849] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fb9c0c76380>
[2024-03-28 22:46:48,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[4e-06], mom=[[0.9, 0.999]]
[2024-03-28 22:46:48,852] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-03-28 22:46:48,852] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-03-28 22:46:48,852] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-03-28 22:46:48,853] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-03-28 22:46:48,853] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-03-28 22:46:48,853] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-03-28 22:46:48,853] [INFO] [config.py:1000:print]   bfloat16_enabled ............. True
[2024-03-28 22:46:48,853] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-03-28 22:46:48,854] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-03-28 22:46:48,854] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-03-28 22:46:48,854] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-03-28 22:46:48,854] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fb9c0ddbf10>
[2024-03-28 22:46:48,854] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-03-28 22:46:48,854] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-03-28 22:46:48,855] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-03-28 22:46:48,855] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-03-28 22:46:48,855] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-03-28 22:46:48,855] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-03-28 22:46:48,855] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-03-28 22:46:48,855] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-03-28 22:46:48,856] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-03-28 22:46:48,856] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-03-28 22:46:48,856] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-03-28 22:46:48,856] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-03-28 22:46:48,856] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-03-28 22:46:48,856] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-03-28 22:46:48,857] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-03-28 22:46:48,858] [INFO] [config.py:1000:print]   fp16_auto_cast ............... None
[2024-03-28 22:46:48,858] [INFO] [config.py:1000:print]   fp16_enabled ................. False
[2024-03-28 22:46:48,858] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-03-28 22:46:48,858] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-03-28 22:46:48,858] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-03-28 22:46:48,858] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 1
[2024-03-28 22:46:48,859] [INFO] [config.py:1000:print]   gradient_clipping ............ 0.0
[2024-03-28 22:46:48,859] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-03-28 22:46:48,859] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-03-28 22:46:48,859] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-03-28 22:46:48,859] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 1
[2024-03-28 22:46:48,859] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-03-28 22:46:48,860] [INFO] [config.py:1000:print]   loss_scale ................... 1.0
[2024-03-28 22:46:48,860] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-03-28 22:46:48,860] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-03-28 22:46:48,860] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-03-28 22:46:48,860] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-03-28 22:46:48,860] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   optimizer_name ............... adamw
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 4e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-03-28 22:46:48,861] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   scheduler_name ............... WarmupLR
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 4e-06, 'warmup_num_steps': 1000}
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   steps_per_print .............. inf
[2024-03-28 22:46:48,862] [INFO] [config.py:1000:print]   train_batch_size ............. 1
[2024-03-28 22:46:48,863] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  1
[2024-03-28 22:46:48,863] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-03-28 22:46:48,863] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-03-28 22:46:48,863] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... False
[2024-03-28 22:46:48,863] [INFO] [config.py:1000:print]   weight_quantization_config ... None
[2024-03-28 22:46:48,863] [INFO] [config.py:1000:print]   world_size ................... 1
[2024-03-28 22:46:48,864] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-03-28 22:46:48,864] [INFO] [config.py:1000:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-03-28 22:46:48,864] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-03-28 22:46:48,864] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-03-28 22:46:48,864] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 2
[2024-03-28 22:46:48,865] [INFO] [config.py:986:print_user_config]   json = {
    "train_batch_size": 1, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 1, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "cpu", 
            "nvme_path": null, 
            "pin_memory": false
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "steps_per_print": inf, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-06, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 4e-06, 
            "warmup_num_steps": 1000
        }
    }
}
2024-03-28 22:46:48,865 [INFO] (__main__) After removing any undesired samples and updating cache entries, we have settled on 150 epochs and 1 steps per epoch.
2024-03-28 22:46:48,982 [INFO] (__main__) After the VAE from orbit, we freed 0.0 MB of VRAM.
2024-03-28 22:46:48,983 [INFO] (__main__) Checkpoint 'latest' does not exist. Starting a new training run.
wandb: Currently logged in as: elismasilva (devaiexp). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.5
wandb: Run data is saved locally in /mnt/f/Projetos/SimpleTuner/wandb/run-20240328_224650-83526739fc703b23a5df47f19f20efab
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run simpletuner-sdxl
wandb: ⭐️ View project at https://wandb.ai/devaiexp/sdxl-training
wandb: 🚀 View run at https://wandb.ai/devaiexp/sdxl-training/runs/83526739fc703b23a5df47f19f20efab/workspace
2024-03-28 22:46:52,929 [INFO] (__main__) ***** Running training *****
2024-03-28 22:46:52,929 [INFO] (__main__)  -> Num batches = 1
2024-03-28 22:46:52,929 [INFO] (__main__)  -> Num Epochs = 150
2024-03-28 22:46:52,930 [INFO] (__main__)  -> Current Epoch = 1
2024-03-28 22:46:52,930 [INFO] (__main__)  -> Instantaneous batch size per device = 2
2024-03-28 22:46:52,931 [INFO] (__main__)  -> Gradient Accumulation steps = 1
2024-03-28 22:46:52,931 [INFO] (__main__)    -> Total train batch size (w. parallel, distributed & accumulation) = 2
2024-03-28 22:46:52,931 [INFO] (__main__)  -> Total optimization steps = 150
2024-03-28 22:46:52,932 [INFO] (__main__)  -> Total optimization steps remaining = 150

Epoch 1/150 Steps:   0%|                                                                             | 0/150 [00:00<?, ?it/s]
Epoch 1/150, Steps:   0%|                                                                            | 0/150 [00:00<?, ?it/s]
Epoch 1/150, Steps:   1%|▍                                                                | 1/150 [01:59<4:57:11, 119.67s/it]
Epoch 1/150, Steps:   1%|▎                                       | 1/150 [02:00<4:57:11, 119.67s/it, lr=0, step_loss=0.00824]
Epoch 1/150, Steps:   1%|▎                                       | 1/150 [02:00<4:57:11, 119.67s/it, lr=0, step_loss=0.00824]
Epoch 1/150, Steps:   1%|▌                                        | 2/150 [02:18<2:29:21, 60.55s/it, lr=0, step_loss=0.00824]
Epoch 1/150, Steps:   1%|▍                                  | 2/150 [02:19<2:29:21, 60.55s/it, lr=4.01e-7, step_loss=0.00588]
Epoch 1/150, Steps:   1%|▍                                  | 2/150 [02:19<2:29:21, 60.55s/it, lr=4.01e-7, step_loss=0.00588]
Epoch 1/150, Steps:   2%|▋                                  | 3/150 [02:36<1:40:49, 41.15s/it, lr=4.01e-7, step_loss=0.00588]
Epoch 1/150, Steps:   2%|▋                                    | 3/150 [02:37<1:40:49, 41.15s/it, lr=6.36e-7, step_loss=0.142]
Epoch 1/150, Steps:   2%|▋                                    | 3/150 [02:37<1:40:49, 41.15s/it, lr=6.36e-7, step_loss=0.142]
Epoch 1/150, Steps:   3%|▉                                    | 4/150 [02:54<1:17:52, 32.00s/it, lr=6.36e-7, step_loss=0.142]
Epoch 1/150, Steps:   3%|▉                                   | 4/150 [02:55<1:17:52, 32.00s/it, lr=8.03e-7, step_loss=0.0735]
Epoch 1/150, Steps:   3%|▉                                   | 4/150 [02:55<1:17:52, 32.00s/it, lr=8.03e-7, step_loss=0.0735]
Epoch 1/150, Steps:   3%|█▏                                  | 5/150 [03:12<1:05:01, 26.91s/it, lr=8.03e-7, step_loss=0.0735]
Epoch 1/150, Steps:   3%|█▏                                   | 5/150 [03:12<1:05:01, 26.91s/it, lr=9.32e-7, step_loss=0.232]
Epoch 1/150, Steps:   3%|█▏                                   | 5/150 [03:13<1:05:01, 26.91s/it, lr=9.32e-7, step_loss=0.232]
Epoch 1/150, Steps:   4%|█▌                                     | 6/150 [03:31<57:33, 23.98s/it, lr=9.32e-7, step_loss=0.232]
Epoch 1/150, Steps:   4%|█▌                                     | 6/150 [03:31<57:33, 23.98s/it, lr=1.04e-6, step_loss=0.134]
Epoch 2/150, Steps:   4%|█▌                                     | 6/150 [03:31<57:33, 23.98s/it, lr=1.04e-6, step_loss=0.134]
Epoch 2/150, Steps:   5%|█▊                                     | 7/150 [03:49<52:27, 22.01s/it, lr=1.04e-6, step_loss=0.134]
Epoch 2/150, Steps:   5%|█▊                                     | 7/150 [03:49<52:27, 22.01s/it, lr=1.13e-6, step_loss=0.205]
Epoch 2/150, Steps:   5%|█▊                                     | 7/150 [03:49<52:27, 22.01s/it, lr=1.13e-6, step_loss=0.205]
Epoch 2/150, Steps:   5%|██                                     | 8/150 [04:07<49:03, 20.73s/it, lr=1.13e-6, step_loss=0.205]
Epoch 2/150, Steps:   5%|██                                     | 8/150 [04:07<49:03, 20.73s/it, lr=1.2e-6, step_loss=0.0188]
Epoch 2/150, Steps:   5%|██                                     | 8/150 [04:07<49:03, 20.73s/it, lr=1.2e-6, step_loss=0.0188]
Epoch 2/150, Steps:   6%|██▎                                    | 9/150 [04:24<46:38, 19.85s/it, lr=1.2e-6, step_loss=0.0188]
Epoch 2/150, Steps:   6%|██▎                                    | 9/150 [04:25<46:38, 19.85s/it, lr=1.27e-6, step_loss=0.124]
Epoch 2/150, Steps:   6%|██▎                                    | 9/150 [04:25<46:38, 19.85s/it, lr=1.27e-6, step_loss=0.124]
Epoch 2/150, Steps:   7%|██▌                                   | 10/150 [04:42<44:58, 19.27s/it, lr=1.27e-6, step_loss=0.124]
Epoch 2/150, Steps:   7%|██▌                                   | 10/150 [04:43<44:58, 19.27s/it, lr=1.33e-6, step_loss=0.107]
Epoch 2/150, Steps:   7%|██▌                                   | 10/150 [04:43<44:58, 19.27s/it, lr=1.33e-6, step_loss=0.107]
bghira commented 7 months ago

ah, i missed that you'd already mentioned you'd done that, i'm sorry - there's an issue i've identified where it appears the debug logging is enabled but that's only in the metadata module.

you'll have to set SIMPLETUNER_LOG_LEVEL=DEBUG for more info on why the images aren't getting captured

elismasilva commented 7 months ago

ah, i missed that you'd already mentioned you'd done that, i'm sorry - there's an issue i've identified where it appears the debug logging is enabled but that's only in the metadata module.

you'll have to set SIMPLETUNER_LOG_LEVEL=DEBUG for more info on why the images aren't getting captured

ive notice that from my 16 images only 2 goes to bucket


2024-03-28 22:46:23,471 [INFO] (BaseMetadataBackend) Enforcing minimum image size of 768.0. This could take a while for very-large datasets.
2024-03-28 22:46:23,471 [INFO] (BaseMetadataBackend) Completed aspect bucket update.
2024-03-28 22:46:23,472 [DEBUG] (BaseMetadataBackend) Refreshing buckets for rank (Rank: 0)  via data_backend id pelolisu.
2024-03-28 22:46:23,472 [DEBUG] (BaseMetadataBackend) Before updating, in all buckets, we had 2.
2024-03-28 22:46:23,472 [DEBUG] (BaseMetadataBackend) After updating, in all buckets, we had 2.
2024-03-28 22:46:23,473 [DEBUG] (BaseMetadataBackend) Count of items before split: 2
2024-03-28 22:46:23,474 [DEBUG] (BaseMetadataBackend) Trimmed from 2 to 2
2024-03-28 22:46:23,474 [DEBUG] (BaseMetadataBackend) Count of items after split: 2
2024-03-28 22:46:23,474 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7fb9c0ddbf40>, 'instance_data_root': '/home/master/wsl-ntfs/dataset', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7fb9c0ddbb50>}
(Rank: 0)  | Bucket     | Image Count 
------------------------------
(Rank: 0)  | 1.0        | 2     
bghira commented 7 months ago

for subsequent tests it's kind of annoying but, images it rejects will be in the metadata json files, and it will not scan them again. you'll have to remove those files for each test.

elismasilva commented 7 months ago

ah, i missed that you'd already mentioned you'd done that, i'm sorry - there's an issue i've identified where it appears the debug logging is enabled but that's only in the metadata module.

you'll have to set SIMPLETUNER_LOG_LEVEL=DEBUG for more info on why the images aren't getting captured

It has a strange behavior. I deleted all the folders and json as you said, I now ran it with the DEBUG variable. and he now says he hasn't found any images. you can see from the log that it finds the files and processes them but for some reason they do not enter the bucket.

2024-03-28 23:10:24,027 [DEBUG] (StateTracker) Setting model type to sdxl
2024-03-28 23:10:24,031 [INFO] (ArgsParser) VAE Model: madebyollin/sdxl-vae-fp16-fix
2024-03-28 23:10:24,031 [INFO] (ArgsParser) Default VAE Cache location: /home/master/wsl-ntfs/dataset/models/cache_vae
2024-03-28 23:10:24,031 [INFO] (ArgsParser) Text Cache location: cache
[2024-03-28 23:10:24,070] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-28 23:10:24,252] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-28 23:10:24,253] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2024-03-28 23:10:24,279 [INFO] (__main__) Updated gradient_accumulation_steps to the value provided by DeepSpeed: 1
2024-03-28 23:10:24,280 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-03-28 23:10:24,280 [INFO] (__main__) Load tokenizers
2024-03-28 23:10:24,402 [INFO] (__main__) Load text encoder 1..
2024-03-28 23:10:25,033 [INFO] (__main__) Load text encoder 2..
2024-03-28 23:10:28,340 [INFO] (__main__) Load VAE..
2024-03-28 23:10:28,743 [INFO] (__main__) Moving models to GPU. Almost there.
2024-03-28 23:10:29,227 [INFO] (__main__) Creating the U-net..
2024-03-28 23:10:31,136 [INFO] (__main__) Moving the U-net to GPU.
2024-03-28 23:10:44,356 [INFO] (__main__) Enabling xformers memory-efficient attention.
2024-03-28 23:10:46,379 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp16, fp32, default
2024-03-28 23:10:46,380 [DEBUG] (__main__) Initialising VAE with weight dtype torch.bfloat16
2024-03-28 23:10:46,603 [INFO] (__main__) Loaded VAE into VRAM.
2024-03-28 23:10:46,624 [DEBUG] (PromptHandler) Initialising Compel prompt manager with dual text encoders.
2024-03-28 23:10:46,625 [INFO] (DataBackendFactory) Configuring text embed backend: pelolisu-embed-cache
2024-03-28 23:10:46,626 [INFO] (TextEmbeddingCache) (Rank: 0) (id=pelolisu-embed-cache) Listing all text embed cache entries
2024-03-28 23:10:46,626 [DEBUG] (LocalDataBackend) LocalDataBackend.list_files: str_pattern=*.pt, instance_data_root=/home/master/wsl-ntfs/textembed_cache
2024-03-28 23:10:46,631 [DEBUG] (StateTracker) set_text_cache_files found 19 images.
2024-03-28 23:10:46,631 [DEBUG] (TextEmbeddingCache) (Rank: 0) (id=pelolisu-embed-cache)  -> done listing all text embed cache entries
2024-03-28 23:10:46,631 [DEBUG] (DataBackendFactory) Set the default text embed cache to pelolisu-embed-cache.
2024-03-28 23:10:46,632 [INFO] (DataBackendFactory) Pre-computing null embedding for caption dropout
2024-03-28 23:10:46,632 [DEBUG] (TextEmbeddingCache) Initialising validations...
2024-03-28 23:10:46,632 [DEBUG] (TextEmbeddingCache) Hashing caption: 
2024-03-28 23:10:46,633 [DEBUG] (TextEmbeddingCache) -> d41d8cd98f00b204e9800998ecf8427e-sdxl
2024-03-28 23:10:46,633 [DEBUG] (TextEmbeddingCache) (Rank: 0) (id=pelolisu-embed-cache) All prompts are cached, ignoring (uncached_prompts=[], is_validation=False, return_concat=False)
2024-03-28 23:10:46,748 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-03-28 23:10:46,749 [INFO] (DataBackendFactory) Configuring data backend: pelolisu
2024-03-28 23:10:46,749 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image'}
2024-03-28 23:10:46,749 [INFO] (DataBackendFactory) (id=pelolisu) Loading bucket manager.
2024-03-28 23:10:46,754 [DEBUG] (LocalDataBackend) Checking if /home/master/wsl-ntfs/dataset/aspect_ratio_bucket_indices.json exists = False
2024-03-28 23:10:46,755 [INFO] (DataBackendFactory) (id=pelolisu) Refreshing aspect buckets on main process.
2024-03-28 23:10:46,755 [INFO] (BaseMetadataBackend) Discovering new files...
2024-03-28 23:10:46,755 [DEBUG] (LocalDataBackend) LocalDataBackend.list_files: str_pattern=*.[jJpP][pPnN][gG], instance_data_root=/home/master/wsl-ntfs/dataset
2024-03-28 23:10:46,762 [DEBUG] (StateTracker) set_image_files found 16 images.
2024-03-28 23:10:46,763 [DEBUG] (LocalDataBackend) Checking if /home/master/wsl-ntfs/dataset/aspect_ratio_bucket_metadata.json exists = False
2024-03-28 23:10:46,768 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (18).jpg.
2024-03-28 23:10:46,772 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (10).jpg.
2024-03-28 23:10:46,776 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (12).jpg.
2024-03-28 23:10:46,781 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (15).jpg.
2024-03-28 23:10:46,787 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (17).jpg.
2024-03-28 23:10:46,788 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD071C0>
2024-03-28 23:10:46,788 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,791 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD072E0>
2024-03-28 23:10:46,791 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,791 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,792 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,792 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E440>
2024-03-28 23:10:46,792 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,793 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,793 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07370>2024-03-28 23:10:46,793 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07580>2024-03-28 23:10:46,793 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD073A0>

2024-03-28 23:10:46,794 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E4D0>2024-03-28 23:10:46,794 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)

2024-03-28 23:10:46,796 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07DF0>
2024-03-28 23:10:46,797 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,798 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,800 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None2024-03-28 23:10:46,798 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (19).jpg.
2024-03-28 23:10:46,801 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E590>2024-03-28 23:10:46,800 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (2).jpg.

2024-03-28 23:10:46,802 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD073D0>
2024-03-28 23:10:46,805 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07A60>
2024-03-28 23:10:46,805 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,806 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)2024-03-28 23:10:46,806 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None

2024-03-28 23:10:46,806 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E5F0>
2024-03-28 23:10:46,808 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07700>
2024-03-28 23:10:46,810 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,810 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (10).jpg has aspect ratio 1.0 and size (768, 768).2024-03-28 23:10:46,811 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (18).jpg has aspect ratio 1.0 and size (768, 768).2024-03-28 23:10:46,812 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None

2024-03-28 23:10:46,813 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7A680>
2024-03-28 23:10:46,813 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,815 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (12).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,815 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,815 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (3).jpg.
2024-03-28 23:10:46,820 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (4).jpg.
2024-03-28 23:10:46,825 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (5).jpg.
2024-03-28 23:10:46,813 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (20).jpg.2024-03-28 23:10:46,814 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07EE0>
2024-03-28 23:10:46,814 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 23:10:46,816 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (14).jpg.
2024-03-28 23:10:46,816 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (11).jpg.2024-03-28 23:10:46,821 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (15).jpg has aspect ratio 1.0 and size (768, 768).2024-03-28 23:10:46,826 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07D30>

Generating aspect bucket cache:   0%|                                        | 0/16 [00:00<?, ?it/s]2024-03-28 23:10:46,831 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07C10>
2024-03-28 23:10:46,831 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (6).jpg.2024-03-28 23:10:46,831 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07E80>
2024-03-28 23:10:46,836 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (17).jpg has aspect ratio 1.0 and size (768, 768).

2024-03-28 23:10:46,848 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,848 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)2024-03-28 23:10:46,838 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07DC0>
2024-03-28 23:10:46,840 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07FA0>
2024-03-28 23:10:46,842 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC1510>2024-03-28 23:10:46,842 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,847 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,848 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (1).jpg.
2024-03-28 23:10:46,849 [DEBUG] (BaseMetadataBackend) Processing file /home/master/wsl-ntfs/dataset/ohwx (16).jpg.
2024-03-28 23:10:46,850 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07CA0>
2024-03-28 23:10:46,850 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,855 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E7A0>2024-03-28 23:10:46,855 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)

2024-03-28 23:10:46,858 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,858 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,859 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD767A0>
2024-03-28 23:10:46,859 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,860 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07E20>
2024-03-28 23:10:46,860 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,861 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,856 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,856 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,858 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,863 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,864 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,864 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,860 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,864 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,868 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC5570>

2024-03-28 23:10:46,863 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)2024-03-28 23:10:46,863 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,864 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,864 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD76950>
2024-03-28 23:10:46,864 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,868 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC13C0>
2024-03-28 23:10:46,870 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,871 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)
2024-03-28 23:10:46,868 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,869 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD769E0>2024-03-28 23:10:46,869 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC15D0>2024-03-28 23:10:46,869 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (2).jpg has aspect ratio 1.0 and size (768, 768).2024-03-28 23:10:46,870 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07BB0>2024-03-28 23:10:46,870 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E6B0>
2024-03-28 23:10:46,871 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)

2024-03-28 23:10:46,871 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)
2024-03-28 23:10:46,872 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None

2024-03-28 23:10:46,873 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07D60>
2024-03-28 23:10:46,872 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None2024-03-28 23:10:46,872 [DEBUG] (MultiaspectImage) Processing image filename: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD3CC70>2024-03-28 23:10:46,872 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,873 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD76740>
2024-03-28 23:10:46,873 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}2024-03-28 23:10:46,873 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC1450>2024-03-28 23:10:46,879 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (4).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,880 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD7E800>

2024-03-28 23:10:46,881 [DEBUG] (MultiaspectImage) Image size before EXIF transform: (768, 768)

2024-03-28 23:10:46,882 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD068F0>2024-03-28 23:10:46,880 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD768C0>2024-03-28 23:10:46,881 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)

2024-03-28 23:10:46,883 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 23:10:46,881 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD76830>

2024-03-28 23:10:46,882 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC1480>

2024-03-28 23:10:46,886 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07F10>
2024-03-28 23:10:46,883 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,887 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,887 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC5630>
2024-03-28 23:10:46,888 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (14).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,888 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,889 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,889 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (11).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,890 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,891 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,891 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (5).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,892 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,893 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,893 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (19).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,894 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,895 [DEBUG] (MultiaspectImage) Image size after EXIF transform: (768, 768)2024-03-28 23:10:46,884 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD07CD0>

2024-03-28 23:10:46,888 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC54B0>
2024-03-28 23:10:46,897 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,900 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (16).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,900 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,901 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,903 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (3).jpg has aspect ratio 1.0 and size (768, 768).

2024-03-28 23:10:46,896 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 23:10:46,896 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257ADC1300>
2024-03-28 23:10:46,908 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (1).jpg has aspect ratio 1.0 and size (768, 768).

2024-03-28 23:10:46,904 [DEBUG] (MultiaspectImage) Dataset: pelolisu, maximum_image_size: None, target_downsample_size: None
2024-03-28 23:10:46,908 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 1, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,908 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,909 [DEBUG] (MultiaspectImage) Received image for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD76A70>2024-03-28 23:10:46,912 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (20).jpg has aspect ratio 1.0 and size (768, 768).

2024-03-28 23:10:46,915 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.2024-03-28 23:10:46,915 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}

2024-03-28 23:10:46,915 [DEBUG] (MultiaspectImage) Converted image to RGB for processing: <PIL.Image.Image image mode=RGB size=768x768 at 0x7F257AD078E0>
2024-03-28 23:10:46,916 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,917 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.
2024-03-28 23:10:46,923 [DEBUG] (JsonMetadataBackend) Image /home/master/wsl-ntfs/dataset/ohwx (6).jpg has aspect ratio 1.0 and size (768, 768).
2024-03-28 23:10:46,924 [DEBUG] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,925 [DEBUG] (BaseMetadataBackend) Bucket worker completed processing. Returning to main thread.

2024-03-28 23:10:46,938 [INFO] (BaseMetadataBackend) Image processing statistics: {'total_processed': 0, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-03-28 23:10:46,939 [DEBUG] (JsonMetadataBackend) save_cache has config to write: {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}
2024-03-28 23:10:46,940 [INFO] (BaseMetadataBackend) Completed aspect bucket update.
2024-03-28 23:10:46,940 [DEBUG] (BaseMetadataBackend) Refreshing buckets for rank (Rank: 0)  via data_backend id pelolisu.
2024-03-28 23:10:46,940 [DEBUG] (BaseMetadataBackend) Before updating, in all buckets, we had 0.
2024-03-28 23:10:46,941 [DEBUG] (BaseMetadataBackend) After updating, in all buckets, we had 0.
2024-03-28 23:10:46,941 [DEBUG] (JsonMetadataBackend) save_cache has config to write: {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}
2024-03-28 23:10:46,942 [DEBUG] (BaseMetadataBackend) Count of items before split: 0
2024-03-28 23:10:46,943 [DEBUG] (BaseMetadataBackend) Count of items after split: 0
2024-03-28 23:10:46,943 [INFO] (DataBackendFactory) Configured backend: {'id': 'pelolisu', 'config': {'vae_cache_clear_each_epoch': False, 'probability': 1.0, 'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'resolution': 768, 'resolution_type': 'pixel', 'caption_strategy': 'textfile', 'maximum_image_size': None, 'target_downsample_size': None}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7f257ad07c70>, 'instance_data_root': '/home/master/wsl-ntfs/dataset', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7f257ad07b50>}
(Rank: 0)  | Bucket     | Image Count 
------------------------------
2024-03-28 23:10:46,946 [ERROR] (__main__) No images were discovered by the bucket manager in the dataset: pelolisu., traceback: Traceback (most recent call last):
  File "/mnt/f/Projetos/SimpleTuner/train_sdxl.py", line 428, in main
    configure_multi_databackend(
  File "/mnt/f/Projetos/SimpleTuner/helpers/data_backend/factory.py", line 492, in configure_multi_databackend
    raise Exception(
Exception: No images were discovered by the bucket manager in the dataset: pelolisu.
elismasilva commented 7 months ago

for subsequent tests it's kind of annoying but, images it rejects will be in the metadata json files, and it will not scan them again. you'll have to remove those files for each test.

Well i change my dataset from jpg to png and now all in bucket. I thought change 1024 to 768 could reduce gpu vram usage from 11gb to 8gb but not changed. meanwhile deepspeed used my full 64gb haha.

bghira commented 7 months ago

thanks, that's helpful

elismasilva commented 7 months ago

this error is very persistent, now i removed json and folder created to start new train and it again. I think it is using some hide cache because when i changed .jpg to .png its worked now i think i need rename files to work.

Exception: No images were discovered by the bucket manager in the dataset: pelolisu.

bghira commented 6 months ago

i think i've resolved it. what happened is there's a list of images inside the cache that is all of the images the trainer processed or attempted to process. this was done to prevent the trainer from walking the same images that can never be cached (eg. there was some error).

however, DELETE_ERRORED_FILES accomplishes the same task but more fundamentally by deleting training samples that can't be loaded.

most of the time, scanning the samples repeatedly is more desirable than simply ignoring them forever.

so, that logic is now reworked so that the new files are ones that do not exist in any aspect bucket, rather than, ones we have attempted to process.

elismasilva commented 6 months ago

i think i've resolved it. what happened is there's a list of images inside the cache that is all of the images the trainer processed or attempted to process. this was done to prevent the trainer from walking the same images that can never be cached (eg. there was some error).

however, DELETE_ERRORED_FILES accomplishes the same task but more fundamentally by deleting training samples that can't be loaded.

most of the time, scanning the samples repeatedly is more desirable than simply ignoring them forever.

so, that logic is now reworked so that the new files are ones that do not exist in any aspect bucket, rather than, ones we have attempted to process.

thank you i will test again later !

BuildBackBuehler commented 6 months ago

I'm also having difficulties with this. Haven't found a workaround, even. @elismasilva when you say you changed the .jpgs to .pngs -- do you mean just changing the extension (filepath) or that you converted the actual images?

For me, it started with an error regarding my batchsize vs. # of images found (I have 10 images, but the metadatabackend would find 5, other times 7, 8 or 9 images w/ each trial. It was random, as to say, all 10 could be found, some more oft than others). I very meticulously cropped my photos to 768x1344 & used HQ photos. By filesize (mb), the PNGs were roughly equivalent to the MP I used for my resolution (1.032). Though I fiddled around with that, didn't seem to matter much if I increased the config's # to 1.3MP (just in case a high megabyte file would be preemptively excluded). Also played around w/ including minimum image size @ 0.1, maximum image size @ 1.5, downsample @ 1.2 and toggled crop=false to true. Deleted the cache .pt file and generally also deleted the .json indices/metadata files. I found there were still issues at times, but it at least rechecked images if I just changed the ID of the image data folder.

Last resort was to use the CLI args for --delete_unwanted_images & problematic images, only tried that once but it wasn't with a "full reset", will update my post if it works with 1 (delete image folder's 2 .JSON files, delete cache SDXL.pt file & swap backend ID name). Never tried to change the name of the directory I used for the instance, will also try that this time & also put a low batch size, just in case.

And I don't know if this was erroneous or just unrelated, but yesterday I was receiving errors for the conversions the helper Multiaspect/Image.py does w/ resolution area.


File "/Users/zack/.home/gitrepos/ComfyUI/SimpleTuner/helpers/metadata/backends/json.py", line 222, in _process_for_bucket

MultiaspectImage.prepare_image(

File "/Users/zack/.home/gitrepos/ComfyUI/SimpleTuner/helpers/multiaspect/image.py", line 93, in prepare_image

MultiaspectImage.calculate_new_size_by_pixel_area(

File "/Users/zack/.home/gitrepos/ComfyUI/SimpleTuner/helpers/multiaspect/image.py", line 414, in calculate_new_size_by_pixel_area

total_pixels = max(megapixels * 1e6, 1e6)

I had fixed this by updating the beginning of the following func.

def calculate_new_size_by_pixel_area(image_width, image_height, megapixels):
    if isinstance(megapixels, (list, tuple)):
        # If megapixels is a sequence, take the mean of the values
        megapixels = sum(megapixels) / len(megapixels)

    total_pixels = max(megapixels * 1e6, 1e6)

....

But then it happened again, earlier on in that script. Then I just modified the top of the script to make the aforementioned fix. Oddly enough, I did an update this morning (AFAIK it did overwrite my changes) but the error didn't reappear.

Edit: I would update the resolution type from area to pixel but I'm unsure what it needs to be with non-square dimensions...the longest edge? (so for me 1344px)

elismasilva commented 6 months ago

I'm also having difficulties with this. Haven't found a workaround, even. @elismasilva when you say you changed the .jpgs to .pngs -- do you mean just changing the extension (filepath) or that you converted the actual images?

For me, it started with an error regarding my batchsize vs. # of images found (I have 10 images, but the metadatabackend would find 5, other times 7, 8 or 9 images w/ each trial. It was random, as to say, all 10 could be found, some more oft than others). I very meticulously cropped my photos to 768x1344 & used HQ photos. By filesize (mb), the PNGs were roughly equivalent to the MP I used for my resolution (1.032). Though I fiddled around with that, didn't seem to matter much if I increased the config's # to 1.3MP (just in case a high megabyte file would be preemptively excluded). Also played around w/ including minimum image size @ 0.1, maximum image size @ 1.5, downsample @ 1.2 and toggled crop=false to true. Deleted the cache .pt file and generally also deleted the .json indices/metadata files. I found there were still issues at times, but it at least rechecked images if I just changed the ID of the image data folder.

Last resort was to use the CLI args for --delete_unwanted_images & problematic images, only tried that once but it wasn't with a "full reset", will update my post if it works with 1 (delete image folder's 2 .JSON files, delete cache SDXL.pt file & swap backend ID name). Never tried to change the name of the directory I used for the instance, will also try that this time & also put a low batch size, just in case.

And I don't know if this was erroneous or just unrelated, but yesterday I was receiving errors for the conversions the helper Multiaspect/Image.py does w/ resolution area.


File "/Users/zack/.home/gitrepos/ComfyUI/SimpleTuner/helpers/metadata/backends/json.py", line 222, in _process_for_bucket

MultiaspectImage.prepare_image(

File "/Users/zack/.home/gitrepos/ComfyUI/SimpleTuner/helpers/multiaspect/image.py", line 93, in prepare_image

MultiaspectImage.calculate_new_size_by_pixel_area(

File "/Users/zack/.home/gitrepos/ComfyUI/SimpleTuner/helpers/multiaspect/image.py", line 414, in calculate_new_size_by_pixel_area

total_pixels = max(megapixels * 1e6, 1e6)

I had fixed this by updating the beginning of the following func.

def calculate_new_size_by_pixel_area(image_width, image_height, megapixels):
    if isinstance(megapixels, (list, tuple)):
        # If megapixels is a sequence, take the mean of the values
        megapixels = sum(megapixels) / len(megapixels)

    total_pixels = max(megapixels * 1e6, 1e6)

....

But then it happened again, earlier on in that script. Then I just modified the top of the script to make the aforementioned fix. Oddly enough, I did an update this morning (AFAIK it did overwrite my changes) but the error didn't reappear.

Edit: I would update the resolution type from area to pixel but I'm unsure what it needs to be with non-square dimensions...the longest edge? (so for me 1344px)

just fileextension but it not solved because i got error again after, but since last update from master i didnt get any error.