Open GreenTeaBD opened 1 year ago
Do you have the ability to test the same with cudatoolkit 10.2 with a venv created from conda or micromamba with conda create -p ./venv python==3.9 cudatoolkit==10.2 -c conda-forge -y && ./venv/bin/python launch.py
?
This mattered less before but, recent commits of automatic1111 web-ui refuse to work with my depth models at least trained with the dreambooth depth model script because they have a broken unet, CLIP, and vae. See this issue for the webui.
non-depthmodels trained with the ShivamShrirao repo don't seem to have this problem.
Running the models through the Model Toolkit extension in the webui outputs this; Architecture
Additional
Rejected
Unknown
Using the models gives a "modules.devices.NansException: A tensor with all NaNs was produced in Unet" error Disabling the nan check makes it "work" but just outputs a black image.
I trained a lot of depth models that error out in this way, generally the training script looked like this export MODEL_NAME="stabilityai/stable-diffusion-2-depth" export INSTANCE_DIR="training/skscodysmall" export CLASS_DIR="classes/man_unsplash" export OUTPUT_DIR="model" accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --pretrained_txt2img_model_name_or_path="stabilityai/stable-diffusion-2-1-base" \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --output_dir=$OUTPUT_DIR \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="skscody" \ --class_prompt="man" \ --seed=1337 \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 --gradient_checkpointing \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=400 \ --sample_batch_size=1 \ --max_train_steps=3000 \
accelerate config is; compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true use_cpu: false