Questions about the training mechanism and loss design

Hi,

Very interesting work and I have some questions.

Similarity of image features are constrainted by L_identity, I wonder why features are used instead of the back-lit and the enhanced image themselves? Is it because it will get better results?
In your paper, the prompt pair and the enhancement part are trained alternatively. When studying the code, I was a little troubled. Could you please tell me how many iterations were trained for each step? Besides, I wonder why alternative training strategy is used instead of end-to-end train (For example, Loss = Lprompt+Lenhance)
In prompt initialization, cross entropy loss is used with the designed labels, but in prompt finetuning process, the loss is changed. May you tell me why?

Thank you so much.

ZhexinLiang / CLIP-LIT