I'm getting an odd error when launching a fine-tune on SDXL on a 1xA100 PCIe node. The training starts but as its trying to finish the first step it errors out:
Epoch 1/1, Steps: 0%| | 1/686 [01:52<12:09:46, 63.92s/it, lr=8e-10, step_loss=0.0885]ERROR RUNNING GUARDS __setitem__ /workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:120
lambda L, **___kwargs_ignored:
___guarded_code.valid and
___check_global_state() and
___check_type_id(L['key'], 94745927993280) and
L['key'] == 'sample' and
___check_type_id(L['self'], 94746043850016) and
L['self'].sample['sample'] is L['value'] and
___check_obj_id(L['__class__'], 94746018006992) and
___check_type_id(L['self'].sample, 94745928005056) and
set(L['self'].sample.keys()) == {'sample'} and
hasattr(L['self'].sample['sample'], '_dynamo_dynamic_indices') == False and
utils_device.CURRENT_DEVICE == None and
(___skip_backend_check() or ___current_backend() == ___lookup_backend(139714280842352)) and
___compile_config_hash() == '40021786b19e902157e2d1176772cf00' and
not ___needs_nopython() and
___check_tensors(L['self'].sample['sample'], tensor_check_names=tensor_check_names)
Traceback (most recent call last):
File "/workspace/SimpleTuner/train_sdxl.py", line 1633, in <module>
main()
File "/workspace/SimpleTuner/train_sdxl.py", line 1236, in main
model_pred = unet(
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 687, in forward
return model_forward(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 675, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1312, in forward
return UNet2DConditionOutput(sample=sample)
File "<string>", line 3, in __init__
File "<string>", line 4, in resume_in___init__
File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py", line 93, in __post_init__
self[field.name] = v
File "<string>", line 16, in guard
IndexError: too many indices for tensor of dimension 4
I have tried changing my config, using another node with a different GPU, and ran this on both Vast & Runpod with no luck. This issue seems too low level for me to debug further.
I'm getting an odd error when launching a fine-tune on SDXL on a 1xA100 PCIe node. The training starts but as its trying to finish the first step it errors out:
I have tried changing my config, using another node with a different GPU, and ran this on both Vast & Runpod with no luck. This issue seems too low level for me to debug further.
Here are my configs:
sdxl-env.sh
export SIMPLETUNER_LOG_LEVEL=DEBUG export SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG export MODEL_TYPE='full' export USE_BITFIT=false export LEARNING_RATE=8e-7 #@param {type:"number"} export MODEL_NAME="orionsoftware/juggernaut-XL-v9" export DEBUG_EXTRA_ARGS="--report_to=wandb" export TRACKER_PROJECT_NAME="sdxl-training" export TRACKER_RUN_NAME="simpletuner-sdxl" export NUM_EPOCHS=1 export MAX_NUM_STEPS=0 export DATALOADER_CONFIG="/workspace/multidatabackend.json" export OUTPUT_DIR="/workspace/output" export RESOLUTION=832 export RESOLUTION_TYPE="pixel" export MINIMUM_RESOLUTION=$RESOLUTION export TRAIN_BATCH_SIZE=1 export GRADIENT_ACCUMULATION_STEPS=1 export LR_SCHEDULE="polynomial" export CAPTION_DROPOUT_PROBABILITY=0.03 export METADATA_UPDATE_INTERVAL=65 export VAE_BATCH_SIZE=4 export DELETE_ERRORED_IMAGES=0 export DELETE_SMALL_IMAGES=0 export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing" export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing" export MIN_SNR_GAMMA=5 export USE_XFORMERS=true export USE_GRADIENT_CHECKPOINTING=true export ALLOW_TF32=true export OPTIMIZER="adamw_bf16" export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr" export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing" export TRAINING_SEED=-1 export MIXED_PRECISION="bf16" export PURE_BF16=true export TRAINING_DYNAMO_BACKEND='inductor'multidatabackend.json
[ { "id": "orion", "dataset_type": "image", "type": "local", "instance_data_dir": "/workspace/combined", "crop": false, "caption_strategy": "textfile", "text_embeds": "alt-embed-cache", "scan_for_errors": true, "repeats": 1, "minimum_image_size": 1.0, "resolution": 832 }, { "id": "alt-embed-cache", "dataset_type": "text_embeds", "default": true, "type": "local", "cache_dir": "/workspace/textembed_cache" } ]And the logs I managed to retrieve:
stdout
2024-04-18 15:55:22,639 [INFO] (__main__) ***** Running training ***** 2024-04-18 15:55:22,639 [INFO] (__main__) -> Num batches = 686 2024-04-18 15:55:22,639 [INFO] (__main__) -> Num Epochs = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Current Epoch = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Instantaneous batch size per device = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Gradient Accumulation steps = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Total train batch size (w. parallel, distributed & accumulation) = 1 2024-04-18 15:55:22,640 [INFO] (__main__) -> Total optimization steps = 686 2024-04-18 15:55:22,640 [INFO] (__main__) -> Total optimization steps remaining = 686 Epoch 1/1, Steps: 0%| | 1/686 [01:52<12:09:46, 63.92s/it, lr=8e-10, step_loss=0.0885]ERROR RUNNING GUARDS __setitem__ /workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:120 lambda L, **___kwargs_ignored: ___guarded_code.valid and ___check_global_state() and ___check_type_id(L['key'], 94745927993280) and L['key'] == 'sample' and ___check_type_id(L['self'], 94746043850016) and L['self'].sample['sample'] is L['value'] and ___check_obj_id(L['__class__'], 94746018006992) and ___check_type_id(L['self'].sample, 94745928005056) and set(L['self'].sample.keys()) == {'sample'} and hasattr(L['self'].sample['sample'], '_dynamo_dynamic_indices') == False and utils_device.CURRENT_DEVICE == None and (___skip_backend_check() or ___current_backend() == ___lookup_backend(139714280842352)) and ___compile_config_hash() == '40021786b19e902157e2d1176772cf00' and not ___needs_nopython() and ___check_tensors(L['self'].sample['sample'], tensor_check_names=tensor_check_names) Traceback (most recent call last): File "/workspace/SimpleTuner/train_sdxl.py", line 1633, indebug.log
https://pastebin.com/HMKvpqqiI'm happy to provide SSH access to a machine with the files already set up to reproduce this error.