askerlee / AdaFace-dev

A Versatile Face Encoder for Zero-Shot Diffusion Model Personalization
MIT License
17 stars 2 forks source link

Alternative idea #1

Open bonlime opened 1 year ago

bonlime commented 1 year ago

Hey, i accidentally found your repo and according to your commits you're working on having a layerwise-textual inversion + use LORA to compress the embeddings size. I've been thinking about same things and could give you an alternative idea, if you would be interested in pursuing it. Or we could at least try discussing it.

The problems with the way you want to do this is as follows: 1. all the interaction is only happening inside text-encoder and you would have to backprop through it to optimise embeddings, which makes the process harder from optimisation perspective (since the path is longer) 2. you would end up having separate embedding for every layer, which means you will have to perform 12 forward passes through text-encoder on inference which isn't that efficient

What I've been thinking of is training single text-embedding (with multiple vectors per token of course) + LORA weights for K, V attention, which would only be applied to your trained token (12 LORA weights for each layer input). This is different from how LORA is implemented here: https://github.com/cloneofsimo/lora because you don't train attention for all tokens, but only for yours, which means no prior preservation is needed, which simplifies the optimisation. Also since the thing we need to optimise is at the start of Unet in theory it should be easier to optimise than embedding.

Let me know what you think, if you want to continue this discussion privately you could text me at bonlimezak at gmail com

askerlee commented 1 year ago

Thank you bonlime! I'm happy that someone noticed my repo so quickly. First, your idea sounds good and probably works. It would be an interesting work if it's implemented.

But, I have two concerns on the line of methods that require tuning of (part of) the unet weights: 1. Deployment is a little bit tricky. Yes you could ship a checkpoint containing only the updated params, but it needs a script to do the patching. Dreambooth also suffers from this issue; custom models are released one after another, which is quite inconvenient for users. 2. If you are trying to deal with multiple custom concepts, can they be trained together? Or do they need to be trained separately? Dreambooth can be trained on multiple concepts, but adding new concepts without disrupting old concepts is not trivial.

On the other hand, my idea seems to avoid these two issues. 1. The deployment is very lightweight and the embedding files are very small. 2. Adding new concepts is easy and won't disrupt old concept, as the model weights are not updated. There's one disadvantage though: the images it generates are still of worse quality than Dreambooth. I'm working to improve it and I believe the gap is not so wide now :)

Anyway it's great to have this opportunity to discuss with you. Let's keep the discussions open in the future :D

bonlime commented 1 year ago
  1. if we only train the matrices for K, V in attention using LORA then it would result in ~3 Mb per learned subject / object, which doesn't seem that bad. Actually applying them would require a single pass using text-encode + applying passing the new tokens through new weights, which would give us 12 vectors to put into unet, it's pretty straightforward to patch any implementation to take list[Tensor] instead of Tensor for text-embedding, so doesn't seem too bad. definitely better than passing a 3.5 Gb checkpoint.

  2. I know that's possible to train on multiple subjects at once with DreamBooth, but i'm interested in combining subjects trained separately

the reason why i'm more leaning towards single embedding -> K,V training is because from my experiments TI could be very-very good, coming to 70-80% of the DB quality. the research is private, but i could share an image with results of pure TI with 6 vectors per subject, rows are different prompts, columns are different seeds. you could see how the identity is preserved across the images. It only needs to be slightly better to be of production quality, that's why i'm researching different ways to improve the results

test_image-grid-polina

p.s. i would highly recommend building on top of 🤗 diffusers, because it's much easier to extend than the original implementation

askerlee commented 1 year ago

Thanks for sharing! This looks impressive. Btw I'm pretty new to the generative AI domain, and don't know how to evaluate different methods. Could you share the training images of a subject, if they are not sensitive? Or are you aware of any standard evaluation benchmark? Thank you so much.

bonlime commented 1 year ago

The evaluating is hard 😢 right now it's only based on my personal judgement, without any metrics. Potentially using some face recognition model to compare "like-ness" could work, but i haven't implemented that.

I can't share the images but could give you a screenshot of them to compare to results

image
askerlee commented 1 year ago

Haha I see. Is this lady your girlfriend? I'm also using my wife's photos for evaluation. Yes using face recognition would be great. Maybe someone has trained models for celebrities (but it's a bit difficult to make sure they don't appear in SD's training data).

askerlee commented 1 year ago

@bonlime I came across this paper today: https://twitter.com/_akhaliq/status/1601030086148464640 https://arxiv.org/abs/2212.04488 Seems very similar to your idea?

bonlime commented 1 year ago

@askerlee it's still different, their main problem is that they fine-tune the whole K,V matrices in unet and this affects all tokens, in the paper they show the effect of "forgetting" previous concepts if they don't use an extra loss. In my idea you only tune the projections for your new words, which shouldn't affect quality of generation for other concepts.

askerlee commented 1 year ago

Do you mean the K, V matrices in the text encoder? But aren't they shared across all tokens? Does your model have to switch to old K, V for other tokens?

bonlime commented 1 year ago

No, I mean K,V matrices in unet. You have cross-attention layer, where "cross" means that you mix image information and text information. the way it's implemented in SD is that you pass image to Q projection, and embeddings after text-encoder to K, V projections.

You are right that this projections are shared by default, in my idea you could do something like this (this is pseudo-code):

text_embs: Tensor, shape [n_tokens, emb_dim] (in SD it's [77, 768]
new_k: Tensor, shape [emb_dim, emb_dim]
new_v: Tensor, shape [emb_dim, emb_dim]
text_embs_k = text_embs.where(~added_tokens_mask, text_embs @ new_k) # [77, 768] @ [768, 768] -> [77, 768]
text_embs_v = text_embs.where(~added_tokens_mask, text_embs @ new_v)

you will need to do this for each layer and pass this to unet. so instead of passing a single text embedding the model would now take 24 text_embeddings (given the SD unet has 12 layers and we need k & v for each).

this could be implemented as an extra module between text-encoder and unet, and only require slight modification in unet to support passing different embedding to different stages