Failed to train and output noise-ckpt in Google Colab

lbj96347 commented 1 year ago

Background

I want to make a workaround in Google Colab for experiencing Anti-DreamBooth. Here is my ipynb. But I failed in Step 3 by running

!bash /content/Anti-DreamBooth/scripts/attack_with_aspl.sh

Google Colab Virtual ENV:

System RAM 12.7 GB
GPU RAM 15GB
DISK 78.2GB

Details

After running attack_with_aspl.sh, it calls Anti-DreamBooth/attacks/aspl.py to run perturbation training. It is expected to output noise-ckpt in output dir. But it didn't.

When Anti-DreamBooth/attacks/aspl.py runs into line 715 , it throws error.

/content/Anti-DreamBooth/scripts/attack_with_aspl.sh: line 36:  1381 Killed                  /usr/bin/python3 /content/Anti-DreamBooth/attacks/aspl.py 

--pretrained_model_name_or_path=$MODEL_PATH 
--enable_xformers_memory_efficient_attention 
--instance_data_dir_for_train=$CLEAN_TRAIN_DIR 
--instance_data_dir_for_adversarial=$CLEAN_ADV_DIR 
--instance_prompt="a photo of sks person" 
--class_data_dir=$CLASS_DIR 
--num_class_images=200 
--class_prompt="a photo of person" 
--output_dir=$OUTPUT_DIR --center_crop 
--with_prior_preservation 
--prior_loss_weight=1.0 
--resolution=512 
--train_text_encoder 
--train_batch_size=1 
--max_train_steps=10 
--max_f_train_steps=3 
--max_adv_train_steps=6 
--checkpointing_iterations=10 
--learning_rate=5e-7 
--pgd_alpha=5e-3 
--pgd_eps=5e-2

Even I have changed --max_train_steps to 10, the execution was also being killed. I guess reason of execution being killed is GPU exhausted?

Because noise-ckpt was not outputed, the program was crashed with this final message:

ValueError: Instance outputs/ASPL/n000050_ADVERSARIAL/noise-ckpt/50 images root 
doesn't exists.

Questions

How much GPU RAM or System RAM required during perturbation training?
Do you have any suggestion for me, such as changing some parameters for completing perturbation training.

hao-pt commented 1 year ago

Hi, it seems like the error you encountered is expected due to insufficient GPU memory.

To train a model of this size, you would need a GPU with at least 32GB of memory. Our experiments are typically conducted using a 40GB A100 GPU.
Alternatively, you can change --sample_batch_size (default: 8) to a lower number (e.g. 2, 4). Besides, you can use the 8-bit optimizer to train on a 16GB GPU, please refer to this tutorial.

lbj96347 commented 1 year ago

@hao-pt thanks for your reply. Yes! The GPU caused the issue. After I changed to a 80GB A100 GPU, everything works fine.

VinAIResearch / Anti-DreamBooth