Text encoder - Githubissues

ucasyjz commented 3 months ago

Very happy to see such an interesting work, I would like to ask if the trained clip's text encoder's will be stronger in encoding the text details in its text, such as orientation, color structure, etc.

Rubics-Xuan commented 3 months ago

Thanks for your interest! We have not yet explored whether the tuned CLIP's text encoder captures textual details more effectively, but since DIVA is purely image-driven, we speculate that the tuned CLIP's text encoder has not substantially enhanced its ability to capture textual details. This work is just the beginning in this direction, and we look forward to more scholars engaging in research in this field!

ucasyjz commented 3 months ago

Thanks

---Original--- From: @.> Date: Tue, Aug 6, 2024 13:41 PM To: @.>; Cc: @.**@.>; Subject: Re: [baaivision/DIVA] Text encoder (Issue #2)

Thanks for your interest! We have not yet explored whether the tuned CLIP's text encoder captures textual details more effectively, but since DIVA is purely image-driven, we speculate that the tuned CLIP's text encoder has not substantially enhanced its ability to capture textual details. This work is just the beginning in this direction, and we look forward to more scholars engaging in research in this field!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Rubics-Xuan commented 2 months ago

In fact, interestingly, while we anticipated upgrading DIVA, we discovered that despite ensuring the CLIP text encoder's weights were unfrozen during generative tuning, the CLIP text encoder did not require updating in our approach. Consequently, to ensure rigor, we refined the corresponding illustrations in our paper.

baaivision / DIVA

Text encoder #2