bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.86k stars 176 forks source link

`AttributeError: 'Validation' object has no attribute 'validation_negative_prompt_mask'` #482

Closed MohamedAliRashad closed 5 months ago

MohamedAliRashad commented 5 months ago

This is the full error

024-06-18 01:59:36,288 [ERROR] (helpers.training.validation) Error gathering text embed for validation prompt : 'Validation' object has no attribute 'validation_negative_prompt_mask', traceback: Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 674, in validate_prompt
    extra_validation_kwargs.update(self._gather_prompt_embeds(prompt))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 371, in _gather_prompt_embeds
    prompt_embeds["negative_mask"] = self.validation_negative_prompt_mask
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Validation' object has no attribute 'validation_negative_prompt_mask'

                                                                                                                                                                                    cannot unpack non-iterable NoneType object                                                                                                                                            
Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/train_sdxl.py", line 2190, in <module>
    main()
  File "/shared_volume/development/text_to_image/SimpleTuner/train_sdxl.py", line 1944, in main
    validation.run_validations(validation_type="intermediary", step=step)
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 416, in run_validations
    self.process_prompts()
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 600, in process_prompts
    self.validate_prompt(prompt, shortname, validation_input_image)
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 654, in validate_prompt
    validation_resolution_width, validation_resolution_height = resolution
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot unpack non-iterable NoneType object

Epoch 1/137, Steps:   0%|                               | 100/30000 [05:32<27:38:01,  3.33s/it, lr=6.61e-7, step_loss=0.0915]
bghira commented 5 months ago

what are you doing that led to this condition?

MohamedAliRashad commented 5 months ago

@bghira Nothing, This is my multidatabackend.json

[
    {
      "id": "pseudo-camera-10k-sd3",
      "type": "local",
      "crop": true,
      "crop_aspect": "square",
      "crop_style": "center",
      "resolution": 0.5,
      "minimum_image_size": 0.25,
      "maximum_image_size": 1.0,
      "target_downsample_size": 1.0,
      "resolution_type": "area",
      "cache_dir_vae": "cache/vae/sd3/pseudo-camera-10k",
      "instance_data_dir": "outputs/datasets/pseudo-camera-10k",
      "disabled": false,
      "skip_file_discovery": "",
      "caption_strategy": "filename",
      "metadata_backend": "json"
    },
    {
      "id": "text-embeds",
      "type": "local",
      "dataset_type": "text_embeds",
      "default": true,
      "cache_dir": "cache/text/sd3/pseudo-camera-10k",
      "disabled": false,
      "write_batch_size": 128
    }
  ]

And this is my sdxl-env.sh

# Configure these values.

# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='full'

# Set this to 'true' if you are training a Stable Diffusion 3 checkpoint.
# Use MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export STABLE_DIFFUSION_3=true
# Similarly, this is to train PixArt Sigma (1K or 2K) models.
# Use MODEL_NAME="PixArt-alpha/PixArt-Sigma-XL-2-1024-MS"
export PIXART_SIGMA=false

# ControlNet model training is only supported when MODEL_TYPE='full'
# See this document for more information: https://github.com/bghira/SimpleTuner/blob/main/documentation/CONTROLNET.md
# DeepFloyd, PixArt, and SD3 do not currently support ControlNet model training.
export CONTROLNET=false

# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false

# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
if [[ "$MODEL_TYPE" == "full" ]]; then
    # When training a full model, we will rely on BitFit to keep the u-net intact.
    export USE_BITFIT=true
elif [[ "$MODEL_TYPE" == "lora" ]]; then
    # As of v0.9.2 of SimpleTuner, LoRA can not use BitFit.
    export USE_BITFIT=false
elif [[ "$MODEL_TYPE" == "deepfloyd-full" ]]; then
    export USE_BITFIT=true
fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=150
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=2

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=8e-7 #@param {type:"number"}

# Using a Huggingface Hub model:
export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
export DEBUG_EXTRA_ARGS="--report_to=tensorboard"
export TRACKER_PROJECT_NAME="sd3-dummy-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"

# Max number of steps OR epochs can be used. Not both.
export MAX_NUM_STEPS=30000
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=0

# A convenient prefix for all of your training paths.
export BASE_DIR="/shared_volume/development/text_to_image/SimpleTuner/outputs"
export DATALOADER_CONFIG="${BASE_DIR}/multidatabackend.json"
export OUTPUT_DIR="${BASE_DIR}/models"
# Set this to "true" to push your model to Hugging Face Hub.
export PUSH_TO_HUB="false"
# If PUSH_TO_HUB and PUSH_CHECKPOINTS are both enabled, every saved checkpoint will be pushed to Hugging Face Hub.
export PUSH_CHECKPOINTS="false"
# This will be the model name for your final hub upload, eg. "yourusername/yourmodelname"
# It defaults to the wandb project name, but you can override this here.
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
# export RESOLUTION=1024
# export RESOLUTION_TYPE="pixel"
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION=1.0          # 1.0 Megapixel training sizes
export RESOLUTION_TYPE="area"
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# How many decimals to round aspect buckets to.
#export ASPECT_BUCKET_ROUNDING=2

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=7.5
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION

# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=8
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=4

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="sine"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=1000

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0.1

export METADATA_UPDATE_INTERVAL=65
export VAE_BATCH_SIZE=12

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=0
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=false

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=false

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=true
# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# Choices: adamw, adamw8bit, adafactor, dadaptation
export OPTIMIZER="adamw_bf16"

# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS=""
## For offset noise training:
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.02"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"

# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=42

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=2
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS="--multi_gpu"                         # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)

export TOKENIZERS_PARALLELISM=false

Also while you are here, i am getting this error when trying to use jsonl

2024-06-18 04:03:44,825 [ERROR] (__main__) Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file., traceback: Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/train_sdxl.py", line 697, in main
    configure_multi_databackend(
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/data_backend/factory.py", line 629, in configure_multi_databackend
    configure_parquet_database(backend, args, init_backend["data_backend"])
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/data_backend/factory.py", line 195, in configure_parquet_database
    df = pd.read_parquet(pq)
         ^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/pandas/io/parquet.py", line 274, in read
    pa_table = self.api.parquet.read_table(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1776, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1343, in __init__
    [fragment], schema=schema or fragment.physical_schema,
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 1367, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

If you can help me with it ^^

MohamedAliRashad commented 5 months ago

@bghira I think i know where the problem is

In helpers/training/validation.py You expect validation resolution to be an integer (e.g. 1024, 768) not a float with resolution_type="area" ... That what caused the issue.

You may want to make this function to support area resolution_type: https://github.com/bghira/SimpleTuner/blob/239c1ce0da496126bfd56de6576103fc37022266/helpers/training/validation.py#L69

MohamedAliRashad commented 5 months ago

@bghira Regarding the other problem (the title of the issue) https://github.com/bghira/SimpleTuner/blob/239c1ce0da496126bfd56de6576103fc37022266/helpers/training/validation.py#L371

Here, you use self.validation_negative_prompt_mask and it only get set in here: https://github.com/bghira/SimpleTuner/blob/239c1ce0da496126bfd56de6576103fc37022266/helpers/training/validation.py#L309

which means it only gets a value if pixart_sigma is enabeled and the inside if condition is met.

bghira commented 5 months ago

oh, you are saying that it returns no resolution when VALIDATION_RESOLUTION=1.0 ?

MohamedAliRashad commented 5 months ago

@bghira Yes

bghira commented 5 months ago

in helpers/arguments.py we see the following:

    if (
        args.validation_resolution.isdigit()
        and int(args.validation_resolution) < 128
        and "deepfloyd" not in args.model_type
    ):
        # Convert from megapixels to pixels:
        log_msg = f"It seems that --validation_resolution was given in megapixels ({args.validation_resolution}). Converting to pixel measurement:"
        if int(args.validation_resolution) == 1:
            args.validation_resolution = 1024
        else:
            args.validation_resolution = int(int(args.validation_resolution) * 1e3)
            # Make it divisible by 8:
            args.validation_resolution = int(int(args.validation_resolution) / 8) * 8
        logger.info(f"{log_msg} {int(args.validation_resolution)}px")

do you see that message in your console output at all? you can check debug.log which will contain the logs from your previous run

MohamedAliRashad commented 5 months ago

@bghira I can't find it, the trainer keeps continuing from my last checkpoint and i don't think my change of resolution and resolution_type is making any difference.

bghira commented 5 months ago

i guess when i wrote that initially, i lazily assumed that floats would work with isdigit() - but i've now added a correction for this handling to ensure it detects either an integer or float value and handles the conversion to the correct value automatically. can you check that?

bghira commented 5 months ago

because the vae cache is such a volatile creature when you change fundamental things like resolution / resolution_type, these are prevented from being changed after the dataset's initial processing.

see the log output during startup where it mentions --override_dataset_config

of course if you change these you'll have to remove the vae cache objects, the aspect bucket json files from the dataset.

i am not sure if you'll have to remove the aspect ratio mapping files from the output dir, everytime i think about it i come to the same conclusion that it's a good thing to just leave that consistent between runs.

MohamedAliRashad commented 5 months ago

@bghira I agree with you on this.

I get this message when i continued from a last checkpoint but with the change of resolution to 1.0: Validation resolution is not supported for this model type.

I will try your fix now

MohamedAliRashad commented 5 months ago

@bghira Your fix worked, what is left now is the validation_negative_prompt_mask bug

bghira commented 5 months ago

i've pushed one for that too now

MohamedAliRashad commented 5 months ago

@bghira The old error disappeared, but a new error emerged:

TypeError: StableDiffusion3Pipeline.__call__() got an unexpected keyword argument 'prompt_mask'
bghira commented 5 months ago

ah. okay. some of the pipelines are so much more sensitive to extra kwargs 🤦

that is now fixed.

MohamedAliRashad commented 5 months ago

@bghira New error:

  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 719, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 928, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/transformers/transformer_sd3.py", line 292, in forward
    hidden_states = self.pos_embed(hidden_states)  # takes care of adding positional embeddings too.
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/embeddings.py", line 208, in forward
    latent = self.proj(latent)
             ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
bghira commented 5 months ago

do you have the debug logs for that section?

MohamedAliRashad commented 5 months ago

This is the full debug.log file

2024-06-18 17:38:02,148 [WARNING] (ArgsParser) Stable Diffusion 3 requires a pixel alignment interval of 64px. Updating value.
2024-06-18 17:38:02,149 [INFO] (ArgsParser) It seems that --validation_resolution was given in megapixels (1.0). Converting to pixel measurement: 1024px
2024-06-18 17:38:02,149 [WARNING] (ArgsParser) Disabling Compel long-prompt weighting for SD3 inference, as it does not support Stable Diffusion 3.
2024-06-18 17:38:02,186 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-06-18 17:38:02,187 [INFO] (__main__) Load tokenizers
2024-06-18 17:38:03,277 [INFO] (__main__) Load OpenAI CLIP-L/14 text encoder..
2024-06-18 17:38:03,795 [INFO] (__main__) Loading T5-XXL v1.1 text encoder from stabilityai/stable-diffusion-3-medium-diffusers/text_encoder..
2024-06-18 17:38:04,362 [INFO] (__main__) Loading LAION OpenCLIP-G/14 text encoder..
2024-06-18 17:38:06,202 [INFO] (__main__) Loading T5-XXL v1.1 text encoder..
2024-06-18 17:38:16,776 [INFO] (__main__) Load VAE..
2024-06-18 17:38:17,204 [INFO] (__main__) Moving models to GPU. Almost there.
2024-06-18 17:38:27,819 [INFO] (__main__) Loading Stable Diffusion 3 diffusion transformer..
2024-06-18 17:38:33,875 [INFO] (__main__) Moving the diffusion transformer to GPU in torch.bfloat16 precision.
2024-06-18 17:38:37,026 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp32, default
2024-06-18 17:38:37,114 [INFO] (__main__) Loaded VAE into VRAM.
2024-06-18 17:38:37,114 [INFO] (DataBackendFactory) Loading data backend config from /shared_volume/development/text_to_image/SimpleTuner/outputs/multidatabackend.json
2024-06-18 17:38:37,115 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds
2024-06-18 17:38:37,116 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-06-18 17:38:37,366 [INFO] (TextEmbeddingCache) (Rank: 1) (id=text-embeds) Listing all text embed cache entries
2024-06-18 17:38:38,103 [INFO] (DataBackendFactory) Pre-computing null embedding
2024-06-18 17:38:43,456 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-06-18 17:38:43,456 [INFO] (DataBackendFactory) Configuring data backend: pseudo-camera-10k-sd3
2024-06-18 17:38:43,457 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-sd3', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 0.5, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_root': 'outputs/datasets/pseudo-camera-10k', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0}, 'dataset_type': 'image'}
2024-06-18 17:38:43,457 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-sd3) Loading bucket manager.
2024-06-18 17:38:43,459 [INFO] (JsonMetadataBackend) Checking for cache file: outputs/datasets/pseudo-camera-10k/aspect_ratio_bucket_indices.json
2024-06-18 17:38:43,459 [INFO] (JsonMetadataBackend) Checking for cache file: outputs/datasets/pseudo-camera-10k/aspect_ratio_bucket_indices.json
2024-06-18 17:38:43,460 [INFO] (JsonMetadataBackend) Pulling cache file from storage
2024-06-18 17:38:43,460 [INFO] (JsonMetadataBackend) Pulling cache file from storage
2024-06-18 17:38:43,477 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-sd3) Refreshing aspect buckets on main process.
2024-06-18 17:38:43,477 [INFO] (BaseMetadataBackend) Discovering new files...
2024-06-18 17:38:52,186 [INFO] (BaseMetadataBackend) Compressed 14102 existing files from 1.
2024-06-18 17:38:52,186 [INFO] (BaseMetadataBackend) No new files discovered. Doing nothing.
2024-06-18 17:38:52,186 [INFO] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 14102, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-06-18 17:38:52,238 [INFO] (JsonMetadataBackend) Checking for cache file: outputs/datasets/pseudo-camera-10k/aspect_ratio_bucket_indices.json
2024-06-18 17:38:52,238 [INFO] (JsonMetadataBackend) Pulling cache file from storage
2024-06-18 17:38:52,244 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-sd3', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 0.5, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_root': 'outputs/datasets/pseudo-camera-10k', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7f0d82e3a4d0>, 'instance_data_root': 'outputs/datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7f0d82e3a610>}
2024-06-18 17:38:52,246 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-sd3) Collecting captions.
2024-06-18 17:38:52,313 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-sd3) Initialise text embed pre-computation using the filename caption strategy. We have 14102 captions to process.
2024-06-18 17:38:53,299 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-sd3) Completed processing 14102 captions.
2024-06-18 17:38:53,300 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-sd3) Creating VAE latent cache.
2024-06-18 17:38:54,369 [INFO] (validation) Precomputing the negative prompt embed for validations.
2024-06-18 17:38:55,208 [INFO] (validation) Precomputing the negative prompt embed for validations.
2024-06-18 17:38:55,356 [INFO] (__main__) Unloading text encoders, as they are not being trained.
2024-06-18 17:38:55,497 [INFO] (__main__) After nuking text encoders from orbit, we freed 0.0 GB of VRAM. The real memories were the friends we trained a model on along the way.
2024-06-18 17:38:55,498 [INFO] (__main__) Collected the following data backends: ['text-embeds', 'pseudo-camera-10k-sd3']
2024-06-18 17:38:55,498 [INFO] (__main__) Loading sine learning rate scheduler with 1000 warmup steps
2024-06-18 17:38:55,499 [WARNING] (__main__) Training Diffusion transformer models with BitFit is not yet tested, and unexpected results may occur.
2024-06-18 17:38:55,501 [INFO] (__main__) Learning rate: 8e-07
2024-06-18 17:38:55,501 [INFO] (__main__) Using bf16 AdamW optimizer with stochastic rounding.
2024-06-18 17:38:55,507 [INFO] (__main__) Optimizer arguments, weight_decay=0.01 eps=1e-08, extra_arguments={'weight_decay': 0.01, 'eps': 1e-08, 'betas': (0.9, 0.999), 'lr': 8e-07}
2024-06-18 17:38:55,509 [INFO] (__main__) Loading sine learning rate scheduler with 1000 warmup steps
2024-06-18 17:38:55,509 [INFO] (__main__) Using Sine learning rate scheduler.
2024-06-18 17:38:55,512 [INFO] (__main__) Loading our accelerator...
2024-06-18 17:38:55,564 [INFO] (__main__) After removing any undesired samples and updating cache entries, we have settled on 137 epochs and 220 steps per epoch.
2024-06-18 17:38:55,633 [INFO] (__main__) Resuming from checkpoint checkpoint-3000
2024-06-18 17:38:55,633 [INFO] (SDXLSaveHook) Unloading text encoders for full SD3 training without --train_text_encoder
2024-06-18 17:38:55,633 [INFO] (SDXLSaveHook) Unloading text encoders for full SD3 training without --train_text_encoder
2024-06-18 17:39:01,692 [INFO] (MultiAspectSampler-pseudo-camera-10k-sd3) Previous checkpoint had 0 exhausted buckets.
2024-06-18 17:39:01,693 [INFO] (MultiAspectSampler-pseudo-camera-10k-sd3) Previous checkpoint was on epoch 14.
2024-06-18 17:39:01,693 [INFO] (MultiAspectSampler-pseudo-camera-10k-sd3) Previous checkpoint had 4480 seen images.
2024-06-18 17:39:01,694 [INFO] (__main__) Resuming from global_step 3000.
2024-06-18 17:39:01,696 [INFO] (MultiAspectSampler-pseudo-camera-10k-sd3) 
(Rank: 0)     -> Number of seen images: 4480
(Rank: 0)     -> Number of unseen images: 2560
(Rank: 0)     -> Current Bucket: None
(Rank: 0)     -> 1 Buckets: ['1.0']
(Rank: 0)     -> 0 Exhausted Buckets: []
2024-06-18 17:39:01,912 [INFO] (__main__) 
***** Running training *****
-  Num batches = 880
-  Num Epochs = 137
  - Current Epoch = 14
-  Total train batch size (w. parallel, distributed & accumulation) = 64
  - Instantaneous batch size per device = 8
  - Gradient Accumulation steps = 4
-  Total optimization steps = 30000
  - Steps completed: 3000
-  Total optimization steps remaining = 27000
2024-06-18 17:44:26,140 [ERROR] (helpers.training.validation) Error generating validation image: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor, Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 719, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 928, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/transformers/transformer_sd3.py", line 292, in forward
    hidden_states = self.pos_embed(hidden_states)  # takes care of adding positional embeddings too.
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/embeddings.py", line 208, in forward
    latent = self.proj(latent)
             ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

2024-06-18 17:44:26,175 [ERROR] (helpers.training.validation) Error generating validation image: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor, Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 719, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 928, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/transformers/transformer_sd3.py", line 292, in forward
    hidden_states = self.pos_embed(hidden_states)  # takes care of adding positional embeddings too.
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/embeddings.py", line 208, in forward
    latent = self.proj(latent)
             ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

2024-06-18 17:51:24,680 [ERROR] (helpers.training.validation) Error generating validation image: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor, Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 719, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 928, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/transformers/transformer_sd3.py", line 292, in forward
    hidden_states = self.pos_embed(hidden_states)  # takes care of adding positional embeddings too.
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/embeddings.py", line 208, in forward
    latent = self.proj(latent)
             ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

2024-06-18 17:51:24,718 [ERROR] (helpers.training.validation) Error generating validation image: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor, Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 719, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 928, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/transformers/transformer_sd3.py", line 292, in forward
    hidden_states = self.pos_embed(hidden_states)  # takes care of adding positional embeddings too.
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/models/embeddings.py", line 208, in forward
    latent = self.proj(latent)
             ^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (CPUBFloat16Type) and weight type (CUDABFloat16Type) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
MohamedAliRashad commented 5 months ago

@bghira

bghira commented 5 months ago

if you set SIMPLETUNER_LOG_LEVEL=DEBUG in the env file we will see more

MohamedAliRashad commented 5 months ago

@bghira Added it and no further information was printed to me, This is the full sdxl-env.sh

# Configure these values.

# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='full'

# Set this to 'true' if you are training a Stable Diffusion 3 checkpoint.
# Use MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export STABLE_DIFFUSION_3=true
# Similarly, this is to train PixArt Sigma (1K or 2K) models.
# Use MODEL_NAME="PixArt-alpha/PixArt-Sigma-XL-2-1024-MS"
export PIXART_SIGMA=false

# ControlNet model training is only supported when MODEL_TYPE='full'
# See this document for more information: https://github.com/bghira/SimpleTuner/blob/main/documentation/CONTROLNET.md
# DeepFloyd, PixArt, and SD3 do not currently support ControlNet model training.
export CONTROLNET=false

# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false

# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
if [[ "$MODEL_TYPE" == "full" ]]; then
    # When training a full model, we will rely on BitFit to keep the u-net intact.
    export USE_BITFIT=true
elif [[ "$MODEL_TYPE" == "lora" ]]; then
    # As of v0.9.2 of SimpleTuner, LoRA can not use BitFit.
    export USE_BITFIT=false
elif [[ "$MODEL_TYPE" == "deepfloyd-full" ]]; then
    export USE_BITFIT=true
fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=150
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=2

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=8e-7 #@param {type:"number"}

# Using a Huggingface Hub model:
export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
export DEBUG_EXTRA_ARGS="--report_to=tensorboard"
export TRACKER_PROJECT_NAME="sd3-dummy-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"

# Max number of steps OR epochs can be used. Not both.
export MAX_NUM_STEPS=30000
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=0

# A convenient prefix for all of your training paths.
export BASE_DIR="/shared_volume/development/text_to_image/SimpleTuner/outputs"
export DATALOADER_CONFIG="${BASE_DIR}/multidatabackend.json"
export OUTPUT_DIR="${BASE_DIR}/models"
# Set this to "true" to push your model to Hugging Face Hub.
export PUSH_TO_HUB="false"
# If PUSH_TO_HUB and PUSH_CHECKPOINTS are both enabled, every saved checkpoint will be pushed to Hugging Face Hub.
export PUSH_CHECKPOINTS="false"
# This will be the model name for your final hub upload, eg. "yourusername/yourmodelname"
# It defaults to the wandb project name, but you can override this here.
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
# export RESOLUTION=1024
# export RESOLUTION_TYPE="pixel"
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION=1.0          # 1.0 Megapixel training sizes
export RESOLUTION_TYPE="area"
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# How many decimals to round aspect buckets to.
#export ASPECT_BUCKET_ROUNDING=2

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=7.5
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION

# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=8
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=4

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="sine"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=1000

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0.1

export METADATA_UPDATE_INTERVAL=65
export VAE_BATCH_SIZE=12

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=0
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=false

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=false

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=true
# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# Choices: adamw, adamw8bit, adafactor, dadaptation
export OPTIMIZER="adamw_bf16"

# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS=""
## For offset noise training:
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.02"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"

# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=42

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=2
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS="--multi_gpu"                         # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)

export TOKENIZERS_PARALLELISM=false

# For more debugging info
export SIMPLETUNER_LOG_LEVEL=DEBUG
bghira commented 5 months ago

a kind of hurricane in my area is disrupting comms. so i might be in-and-out.

but i've pushed an option to use GPU seeds by default with CPU seeds as opt-in. is it easy to run this test or does it take a while?

MohamedAliRashad commented 5 months ago

@bghira No, it's easy. I will test your latest commit

MohamedAliRashad commented 5 months ago

@bghira The new error:

024-06-18 19:59:59,763 [ERROR] (helpers.training.validation) Error generating validation image: Cannot generate a cpu tensor from a generator of type cuda., Traceback (most recent call last):          | 0/2 [00:00<?, ?it/s]
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 727, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 902, in __call__
    latents = self.prepare_latents(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 677, in prepare_latents
    latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/utils/torch_utils.py", line 67, in randn_tensor
    raise ValueError(f"Cannot generate a {device} tensor from a generator of type {gen_device_type}.")
ValueError: Cannot generate a cpu tensor from a generator of type cuda.

2024-06-18 19:59:59,774 [ERROR] (helpers.training.validation) Error generating validation image: Cannot generate a cpu tensor from a generator of type cuda., Traceback (most recent call last):
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/training/validation.py", line 727, in validate_prompt
    validation_image_results = self.pipeline(**pipeline_kwargs).images
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 902, in __call__
    latents = self.prepare_latents(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/helpers/sd3/pipeline.py", line 677, in prepare_latents
    latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared_volume/development/text_to_image/SimpleTuner/.venv/lib/python3.11/site-packages/diffusers/utils/torch_utils.py", line 67, in randn_tensor
    raise ValueError(f"Cannot generate a {device} tensor from a generator of type {gen_device_type}.")
ValueError: Cannot generate a cpu tensor from a generator of type cuda.
bghira commented 5 months ago

huh.. everything should already be on the GPU. its frustrating that i cannot reproduce this issue locally, but i will take a look

bghira commented 5 months ago

so i think this is the same error manifested in MPS form:

    latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)
RuntimeError: Placeholder storage has not been allocated on MPS device!
bghira commented 5 months ago

well, i was able to reproduce the issue. and then, the 2nd launch, it works! i made no changes in-between. i freshly recreated the text cache, and it doesn't bring the issue back.

MohamedAliRashad commented 5 months ago

I will delete the chaches and try again

MohamedAliRashad commented 5 months ago

@bghira It worked.

Thank you