Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) by way of Textual Inversion (https://arxiv.org/abs/2208.01618) for Stable Diffusion (https://arxiv.org/abs/2112.10752). Tweaks focused on training faces, objects, and styles.
MIT License
3.2k
stars
557
forks
source link
support freezing text_encoder layers for OpenCLIP #189
At least for SD2.1, freezing the first n layers (typically, 17) allows it to learn more effectively without catastrophic loss / destructive moves to the base layers.
This helps the model keep concepts that aren't trained in, but are colocated to the tensor subgroups that we're training.
It can be done like so:
first_frozen_layer = 0
last_frozen_layer = 0
total_count = 0
for name, param in text_encoder.named_parameters():
total_count += 1
pieces = name.split(".")
if pieces[1] != "encoder" and pieces[2] != "layers":
print(f"Ignoring non-encoder layer: {name}")
continue
print(f'Pieces: {pieces}')
current_layer = int(pieces[3])
if (
current_layer >= first_frozen_layer and current_layer < 21
): # choose whatever you like to freeze, here
last_frozen_layer = current_layer
if hasattr(param, 'requires_grad'):
param.requires_grad = False
print(f'Froze layer: {name}')
else:
print(f'Ignoring layer that does not mark as gradient capable: {name}')
At least for SD2.1, freezing the first n layers (typically, 17) allows it to learn more effectively without catastrophic loss / destructive moves to the base layers.
This helps the model keep concepts that aren't trained in, but are colocated to the tensor subgroups that we're training.
It can be done like so: