bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.56k stars 139 forks source link

IndexError: too many indices for tensor of dimension 4 #355

Closed komninoschatzipapas closed 5 months ago

komninoschatzipapas commented 5 months ago

I'm getting an odd error when launching a fine-tune on SDXL on a 1xA100 PCIe node. The training starts but as its trying to finish the first step it errors out:

Epoch 1/1, Steps:   0%|                                       | 1/686 [01:52<12:09:46, 63.92s/it, lr=8e-10, step_loss=0.0885]ERROR RUNNING GUARDS __setitem__ /workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:120
lambda L, **___kwargs_ignored:
  ___guarded_code.valid and
  ___check_global_state() and
  ___check_type_id(L['key'], 94745927993280) and
  L['key'] == 'sample' and
  ___check_type_id(L['self'], 94746043850016) and
  L['self'].sample['sample'] is L['value'] and
  ___check_obj_id(L['__class__'], 94746018006992) and
  ___check_type_id(L['self'].sample, 94745928005056) and
  set(L['self'].sample.keys()) == {'sample'} and
  hasattr(L['self'].sample['sample'], '_dynamo_dynamic_indices') == False and
  utils_device.CURRENT_DEVICE == None and
  (___skip_backend_check() or ___current_backend() == ___lookup_backend(139714280842352)) and
  ___compile_config_hash() == '40021786b19e902157e2d1176772cf00' and
  not ___needs_nopython() and
  ___check_tensors(L['self'].sample['sample'], tensor_check_names=tensor_check_names)
Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1633, in <module>
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 1236, in main
    model_pred = unet(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 687, in forward
    return model_forward(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 675, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1312, in forward
    return UNet2DConditionOutput(sample=sample)
  File "<string>", line 3, in __init__
  File "<string>", line 4, in resume_in___init__
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py", line 93, in __post_init__
    self[field.name] = v
  File "<string>", line 16, in guard
IndexError: too many indices for tensor of dimension 4

I have tried changing my config, using another node with a different GPU, and ran this on both Vast & Runpod with no luck. This issue seems too low level for me to debug further.

Here are my configs:

sdxl-env.sh export SIMPLETUNER_LOG_LEVEL=DEBUG export SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG export MODEL_TYPE='full' export USE_BITFIT=false export LEARNING_RATE=8e-7 #@param {type:"number"} export MODEL_NAME="orionsoftware/juggernaut-XL-v9" export DEBUG_EXTRA_ARGS="--report_to=wandb" export TRACKER_PROJECT_NAME="sdxl-training" export TRACKER_RUN_NAME="simpletuner-sdxl" export NUM_EPOCHS=1 export MAX_NUM_STEPS=0 export DATALOADER_CONFIG="/workspace/multidatabackend.json" export OUTPUT_DIR="/workspace/output" export RESOLUTION=832 export RESOLUTION_TYPE="pixel" export MINIMUM_RESOLUTION=$RESOLUTION export TRAIN_BATCH_SIZE=1 export GRADIENT_ACCUMULATION_STEPS=1 export LR_SCHEDULE="polynomial" export CAPTION_DROPOUT_PROBABILITY=0.03 export METADATA_UPDATE_INTERVAL=65 export VAE_BATCH_SIZE=4 export DELETE_ERRORED_IMAGES=0 export DELETE_SMALL_IMAGES=0 export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing" export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing" export MIN_SNR_GAMMA=5 export USE_XFORMERS=true export USE_GRADIENT_CHECKPOINTING=true export ALLOW_TF32=true export OPTIMIZER="adamw_bf16" export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr" export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing" export TRAINING_SEED=-1 export MIXED_PRECISION="bf16" export PURE_BF16=true export TRAINING_DYNAMO_BACKEND='inductor'
multidatabackend.json [ { "id": "orion", "dataset_type": "image", "type": "local", "instance_data_dir": "/workspace/combined", "crop": false, "caption_strategy": "textfile", "text_embeds": "alt-embed-cache", "scan_for_errors": true, "repeats": 1, "minimum_image_size": 1.0, "resolution": 832 }, { "id": "alt-embed-cache", "dataset_type": "text_embeds", "default": true, "type": "local", "cache_dir": "/workspace/textembed_cache" } ]

And the logs I managed to retrieve:

stdout 2024-04-18 15:55:22,639 [INFO] (__main__) ***** Running training ***** 2024-04-18 15:55:22,639 [INFO] (__main__) -> Num batches = 686 2024-04-18 15:55:22,639 [INFO] (__main__) -> Num Epochs = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Current Epoch = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Instantaneous batch size per device = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Gradient Accumulation steps = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Total train batch size (w. parallel, distributed & accumulation) = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Total optimization steps = 686 2024-04-18 15:55:22,640 [INFO] (__main__) -> Total optimization steps remaining = 686 Epoch 1/1, Steps: 0%| | 1/686 [01:52<12:09:46, 63.92s/it, lr=8e-10, step_loss=0.0885]ERROR RUNNING GUARDS __setitem__ /workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:120 lambda L, **___kwargs_ignored: ___guarded_code.valid and ___check_global_state() and ___check_type_id(L['key'], 94745927993280) and L['key'] == 'sample' and ___check_type_id(L['self'], 94746043850016) and L['self'].sample['sample'] is L['value'] and ___check_obj_id(L['__class__'], 94746018006992) and ___check_type_id(L['self'].sample, 94745928005056) and set(L['self'].sample.keys()) == {'sample'} and hasattr(L['self'].sample['sample'], '_dynamo_dynamic_indices') == False and utils_device.CURRENT_DEVICE == None and (___skip_backend_check() or ___current_backend() == ___lookup_backend(139714280842352)) and ___compile_config_hash() == '40021786b19e902157e2d1176772cf00' and not ___needs_nopython() and ___check_tensors(L['self'].sample['sample'], tensor_check_names=tensor_check_names) Traceback (most recent call last): File "/workspace/SimpleTuner/train_sdxl.py", line 1633, in main() File "/workspace/SimpleTuner/train_sdxl.py", line 1236, in main model_pred = unet( File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 687, in forward return model_forward(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 675, in __call__ return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, **kwargs) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1312, in forward return UNet2DConditionOutput(sample=sample) File "", line 3, in __init__ File "", line 4, in resume_in___init__ File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py", line 93, in __post_init__ self[field.name] = v File "", line 16, in guard IndexError: too many indices for tensor of dimension 4 wandb: | 0.038 MB of 0.038 MB uploaded wandb: Run history: wandb: epoch ▁ wandb: learning_rate ▁ wandb: optimization_loss ▁ wandb: train_loss ▁ wandb: train_luminance ▁ wandb: wandb: Run summary: wandb: epoch 1 wandb: learning_rate 0.0 wandb: optimization_loss 0.08845 wandb: train_loss 0.08845 wandb: train_luminance 137.7224 wandb: wandb: 🚀 View run simpletuner-sdxl at: https://wandb.ai/komninos/sdxl-training/runs/fd42599c363378a02341dd1e12629cb0/workspace wandb: Synced 5 W&B file(s), 1 media file(s), 1 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240418_155521-fd42599c363378a02341dd1e12629cb0/logs Traceback (most recent call last): File "/workspace/SimpleTuner/.venv/bin/accelerate", line 8, in sys.exit(main()) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command simple_launcher(args) File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/SimpleTuner/.venv/bin/python', 'train_sdxl.py', '--model_type=full', '--pretrained_model_name_or_path=orionsoftware/juggernaut-XL-v9', '--enable_xformers_memory_efficient_attention', '--gradient_checkpointing', '--set_grads_to_none', '--gradient_accumulation_steps=1', '--resume_from_checkpoint=latest', '--snr_gamma=5', '--data_backend_config=/workspace/multidatabackend.json', '--num_train_epochs=1', '--max_train_steps=0', '--metadata_update_interval=65', '--adam_bfloat16', '--learning_rate=8e-7', '--lr_scheduler=polynomial', '--seed', '-1', '--lr_warmup_steps=1000', '--output_dir=/workspace/output', '--inference_scheduler_timestep_spacing=trailing', '--training_scheduler_timestep_spacing=trailing', '--report_to=wandb', '--allow_tf32', '--mixed_precision=bf16', '--vae_dtype=bf16', '--prediction_type=v_prediction', '--rescale_betas_zero_snr', '--training_scheduler_timestep_spacing=trailing', '--inference_scheduler_timestep_spacing=trailing', '--train_batch=1', '--caption_dropout_probability=0.03', '--validation_prompt=ethnographic photography of teddy bear at a picnic', '--num_validation_images=1', '--validation_num_inference_steps=30', '--validation_seed=42', '--minimum_image_size=832', '--resolution=832', '--validation_resolution=1024', '--resolution_type=pixel', '--checkpointing_steps=150', '--checkpoints_total_limit=2', '--validation_steps=100', '--tracker_run_name=simpletuner-sdxl', '--tracker_project_name=sdxl-training', '--validation_guidance=7.5', '--validation_guidance_rescale=0.0', '--validation_negative_prompt=blurry, cropped, ugly']' returned non-zero exit status 1.
debug.log https://pastebin.com/HMKvpqqi

I'm happy to provide SSH access to a machine with the files already set up to reproduce this error.

bghira commented 5 months ago

that's just a torch inductor error, i believe. set dynamo to 'no'.