ZhexinLiang / CLIP-LIT

[ICCV 2023, Oral] Iterative Prompt Learning for Unsupervised Backlit Image Enhancement
https://zhexinliang.github.io/CLIP_LIT_page/
274 stars 23 forks source link

Question about the prompt tuning #15

Open TomTomTommi opened 10 months ago

TomTomTommi commented 10 months ago

Hi,

Very interesting work! In your code, I wonder why the text encoder takes the composition of learnable and non-learnable embedding of [X, X, X...] as the input. Is it the conventional setting for the clip prompt tuning in low-level vision task. Besides, I would like to ask why the exact category (i.e., positive / negative prompts) are not part of the initialised prompts. It seems that we only use the labels of positive/negative prompts when compute the cross-entropy loss.

Thank you so much.

ZhexinLiang commented 10 months ago

Hi @TomTomTommi, thanks for your interest in our work.

  1. All the prompts are learnable in our settings. X means initializing with meaningless word.

  2. As shown in Section A.2. in our supplementary material, using certain words/sentences to do initialization can accelerate the convergence but does not influence the final performance. image

  3. Why not add and freeze part of the text prompts as a category label during training? This is because it is hard to find an accurate word/sentence to describe very complex illuminance distribution, so we learn all the prompts during training.

Feel free to discuss with me if you have any other questions.

TomTomTommi commented 10 months ago

Thanks so much for your response. That helps me a lot.

Just a double check with the detailed setting. For figure15, according to Eq.2 and Eq.3, the code in train.py should be modified: line 141: learn_prompt=Prompts([" ".join(["X"]*(config.length_prompt))," ".join(["X"]*(config.length_prompt))]).cuda() --> learn_prompt=Prompts([XXX ... X backlit, XXX ... X well-lit]).cuda() ;

line 92: tokenized_prompts= torch.cat([clip.tokenize(p) for p in [" ".join(["X"]*config.length_prompt)]]) --> tokenized_prompts= torch.cat([clip.tokenize([XXX ... X well-lit])]]) ;

line 220: tokenized_prompts= torch.cat([clip.tokenize(p) for p in [" ".join(["X"]*config.length_prompt)]]) --> tokenized_prompts= torch.cat([clip.tokenize([XXX ... X backlit])]]) ;

Is it correct?

ZhexinLiang commented 10 months ago

Hi @TomTomTommi, glad to see my answer was helpful.

If you want to change the initialization from [X X X ... X X] to [X X X ... X backlit/well-lit], you only need to modify line 141: learn_prompt=Prompts([" ".join(["X"]*(config.length_prompt))," ".join(["X"]*(config.length_prompt))]).cuda() --> learn_prompt=Prompts([" ".join(["X"]*(config.length_prompt-1))+" backlit"," ".join(["X"]*(config.length_prompt-1))+" well-lit"]).cuda()

line 92 and line 220 could remain unchanged, because these two lines are used in tokenized_prompts.argmax(dim=-1) in line 64 to get the representation at the EOT position. Thus, if the length of the prompt are not changed, whether to change these two lines does not matter.

TomTomTommi commented 9 months ago

Thanks so much for your response.

Following the training command in the README, I got the training log as follows: I want to know why in the 10k-20k iterations the clip loss and reconstruction loss are constantly oscillating and the clip loss then suddenly dropped. Plus, in the 25k-40k iterations, the prompt loss is also oscillating without converging. Are these the normal training process?

image image

ZhexinLiang commented 9 months ago

Hi @TomTomTommi,

Yes, you are right. I've noticed this phenomenon too. This will occur sometimes if you train from scratch. I think this is because our network is relatively simple and hard to find a convergent path at first. I tried to fix it, but obviously, I failed (haha). Anyway, this won't influence the final performance if you train the network for 50k iterations or more.

I rerun the experiments both from scratch and using default settings with 2 initializations of the prompt and enhancement model.

Here are the screenshots of the training logs for the default setting and training from scratch: image image

Here are the screenshots of the training logs for the default setting and training from initialization models: image image

Based on my experiments, the performance of the results generated from both of the two retrained models at around 50K iterations is competitive to the results presented in our paper.

TomTomTommi commented 9 months ago

Thanks so much for the response.

I am also curious about how to get the results of Figure 2. Is there a script to get the score?