grads is None after two iterations of data

bimsarapathiraja commented 1 month ago

I ran the given code with same accelerator config and base_config files with some minor changes. I created a folder and added 100 images as the data directory.

Only modified configs in the file are:

data dir, placeholder_token_count, tokens_per_iter and xformers

My base_config file is as follows:

save_steps = 100 # How frequent to save
pretrained_model = "runwayml/stable-diffusion-v1-5" # Pretrained model signature or path
tokenizer_name = None # If using a different tokenizer
train_data_dir = "/scratch/bpathir1/NoiseCLR/data/afhq/noiseclr" # Path to training data
placeholder_token_count = 10 # Number of directions to learn
initializer_tokens = "" # If provided initializing tokens from these
repeats = 1 # How many times to repeat the training data
output_dir = "outputs/directions" # Output diretory
seed = 0 # For reproducable experiments
center_crop = False
train_batch_size = 6 # Per GPU
tokens_per_iter = 2
num_train_iters = 1000
gradient_accumulation_steps = 4
gradient_checkpointing = True
scale_lr = True
lr_scheduler = "constant" # ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"]
lr_warmup_steps = 500
resolution = 512
adam_beta1 = 0.9
adam_beta2 = 0.999
adam_weight_decay = 1e-2
adam_epsilon = 1e-08
learning_rate = 1e-3
temperature = 0.5
logging_dir = "logs"
mixed_precision = "no"
resume_from_checkpoint = None
resume_dir = None
normalize_word = False
enable_xformers_memory_efficient_attention = False
validation_step = 100
center_crop = False
subtract_uncond = True
checkpointing_steps = 100

I get the following error in the third iteration of the first epoch.

   grads.data[index_no_updates, :] = grads.data[index_no_updates, :].fill_(0)
                                      ^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'data'

What could be the reason and how to avoid this error?

yusufdalva commented 1 month ago

Hi,

Thank you for your interest in our work. Due to the nature of the calculated contrastive loss, having a necessary amount of positive and negative pairs is one thing with high priority. With the changes you have with the config file, I suggest to change the variable "tokens_per_iter = 2", which is low for the loss function that we design. For the contrastive loss to be effective, the positive and negative sums should have enough pairs to pass meaningful gradients. However, I will be investigating the issue with more detail and will provide a fix if another issue is the case.

Thanks, Yusuf

bimsarapathiraja commented 1 month ago

Hey I ran with the exact same configs and still ran into the same problem. So I installed all the libraries in your .yaml file with exact same versions. Previously I used one of my already installed envs which was able to run the train.py without any problem except for the issue mentioned above.

I was able to successfully run the code without any problem using the env you have provided.

gemlab-vt / NoiseCLR

grads is None after two iterations of data #4