Closed ucasyjz closed 3 months ago
Thanks for your interest! We have not yet explored whether the tuned CLIP's text encoder captures textual details more effectively, but since DIVA is purely image-driven, we speculate that the tuned CLIP's text encoder has not substantially enhanced its ability to capture textual details. This work is just the beginning in this direction, and we look forward to more scholars engaging in research in this field!
Thanks
---Original--- From: @.> Date: Tue, Aug 6, 2024 13:41 PM To: @.>; Cc: @.**@.>; Subject: Re: [baaivision/DIVA] Text encoder (Issue #2)
Thanks for your interest! We have not yet explored whether the tuned CLIP's text encoder captures textual details more effectively, but since DIVA is purely image-driven, we speculate that the tuned CLIP's text encoder has not substantially enhanced its ability to capture textual details. This work is just the beginning in this direction, and we look forward to more scholars engaging in research in this field!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
In fact, interestingly, while we anticipated upgrading DIVA, we discovered that despite ensuring the CLIP text encoder's weights were unfrozen during generative tuning, the CLIP text encoder did not require updating in our approach. Consequently, to ensure rigor, we refined the corresponding illustrations in our paper.
Very happy to see such an interesting work, I would like to ask if the trained clip's text encoder's will be stronger in encoding the text details in its text, such as orientation, color structure, etc.