ASPL script fails training on SD 1.4.

whulizheng commented 11 months ago

Hello, interesting work! I want to test on SD 1.4 with the default script but the loss will be "nan" after the first epoch. I also tried in fp32, but still the same. Could you please offer ASPL scripts on SD 1.4? Thanks a lot.

Luvata commented 11 months ago

Thank you for your interest in our project! We'll try to reproduce your issue of NaN loss for ASPL with Stable Diffusion 1.4 later today. In the meantime, could you please provide some more details that would help us investigate:

Did you modify the default script at all, or are you using it as-is?
Which Python library versions are you using for PyTorch, Transformers, Diffusers
Does this happen right away in the first epoch, or after training for some amount of time?

whulizheng commented 11 months ago

Thank you for your interest in our project! We'll try to reproduce your issue of NaN loss for ASPL with Stable Diffusion 1.4 later today. In the meantime, could you please provide some more details that would help us investigate:

Did you modify the default script at all, or are you using it as-is?

Which Python library versions are you using for PyTorch, Transformers, Diffusers

Does this happen right away in the first epoch, or after training for some amount of time?

Hi, thanks a lot. I only modified the model path of the script, and PyTorch, Transformers, and Diffusers are all the same version as they are in the requirements.txt, only the first epoch outputs normal loss, and after that, they all become "nan" like this:

Step #0, loss: 0.23414941132068634, prior_loss: 0.2278386950492859, instance_loss: 0.006310714408755302 Step #1, loss: nan, prior_loss: nan, instance_loss: nan Step #2, loss: nan, prior_loss: nan, instance_loss: nan PGD loss - step 0, loss: nan PGD loss - step 1, loss: nan PGD loss - step 2, loss: nan

Luvata commented 11 months ago

Sorry I can't reproduce your issue, I've test both bf16, fp16 and no by changing mixed_precision in attack_with_aspl.sh

Below is the expected output in the terminal

bash scripts/attack_with_aspl.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
11/09/2023 17:39:43 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'timestep_post_act', 'resnet_time_scale_shift', 'mid_block_type', 'time_embedding_act_fn', 'addition_time_embed_dim', 'addition_embed_type', 'class_embed_type', 'use_linear_projection', 'projection_class_embeddings_input_dim', 'transformer_layers_per_
block', 'dual_cross_attention', 'resnet_skip_time_act', 'addition_embed_type_num_heads', 'cross_attention_norm', 'time_embedding_dim', 'encoder_hid_dim_type', 'mid_block_only_cross_attention', 'time_cond_proj_dim', 'attention_type', 'num_class_embeds',
 'resnet_out_scale_factor', 'encoder_hid_dim', 'num_attention_heads', 'time_embedding_type', 'conv_out_kernel', 'upcast_attention', 'only_cross_attention', 'conv_in_kernel', 'class_embeddings_concat'} was not found in config. Values will be initialized
 to default values.
{'clip_sample_range', 'thresholding', 'prediction_type', 'timestep_spacing', 'sample_max_value', 'dynamic_thresholding_ratio', 'variance_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor', 'force_upcast', 'norm_num_groups'} was not found in config. Values will be initialized to default values.
Step #0, loss: 0.752631425857544, prior_loss: 0.7393704652786255, instance_loss: 0.013260948471724987
Step #1, loss: 0.18438610434532166, prior_loss: 0.1214214637875557, instance_loss: 0.06296463310718536
Step #2, loss: 0.23523551225662231, prior_loss: 0.13061320781707764, instance_loss: 0.10462230443954468
PGD loss - step 0, loss: 0.06669247150421143
PGD loss - step 1, loss: 0.23700952529907227
PGD loss - step 2, loss: 0.17454129457473755
PGD loss - step 3, loss: 0.30680063366889954
PGD loss - step 4, loss: 0.2727632522583008
PGD loss - step 5, loss: 0.3792399764060974
Step #0, loss: 0.5648417472839355, prior_loss: 0.5227689146995544, instance_loss: 0.042072828859090805
Step #1, loss: 0.244808629155159, prior_loss: 0.2364426851272583, instance_loss: 0.008365947753190994
Step #2, loss: 0.31962481141090393, prior_loss: 0.0035879784263670444, instance_loss: 0.31603682041168213

Luvata commented 11 months ago

Since I can't reproduce it, could you please double check by re-running the default script attack_with_aspl.sh (change the stable diffusion path to your correct SD path), and let me know your hardware specs. That will really help me understand the difference between our environments.

whulizheng commented 11 months ago

Hi, after double checking and re-running, it still happens.

However, when I disabled arg "--enable_xformers_memory_efficient_attention \", it back to normal like:

11/09/2023 11:12:49 - INFO - __main__ - Distributed environment: NO

Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'class_embed_type', 'resnet_time_scale_shift', 'projection_class_embeddings_input_dim', 'upcast_attention', 'dual_cross_attention', 'conv_out_kernel', 'use_linear_projection', 'timestep_post_act', 'only_cross_attention', 'num_class_embeds', 'mid_block_type', 'time_cond_proj_dim', 'time_embedding_type', 'conv_in_kernel'} was not found in config. Values will be initialized to default values.
{'variance_type', 'clip_sample_range', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'norm_num_groups'} was not found in config. Values will be initialized to default values.
Step #0, loss: 0.11835033446550369, prior_loss: 0.06459389626979828, instance_loss: 0.053756438195705414
Step #1, loss: 0.47151514887809753, prior_loss: 0.0061028883792459965, instance_loss: 0.4654122591018677
Step #2, loss: 0.3747083842754364, prior_loss: 0.309255450963974, instance_loss: `0.0654529333114624

My Xformers are installed from "pip install -r requirements.txt", and my GPU is NVIDIA RTX A6000 with Driver Version: 535.129.03. I guess it's the problem of Xformers and my GPU drivers, but it is still so weird why SD 2.1 works under Xformers while SD 1.4 fails with the same environment and config.

thuanz123 commented 11 months ago

yeah xformers is causing other repo to nan loss as well, try different version xformers to see if it's OK

VinAIResearch / Anti-DreamBooth

ASPL script fails training on SD 1.4. #16