Open RuiTianHIT opened 3 months ago
@RuiTianHIT , the main idea of EVP was to solve the problem that Unet needs the text input, but in the depth datasets there are no text available. So, I don't know other variants how to solve this problem, except to generate text embeddings from text or direct from image (using CLIP image encoder)...
Dear author! We are interested in your high-quality and excellent work. We want to explore the ability of depth estimation when the model does not introduce text information. However, we forced self.conditioning key == None, and an error occurred during this process. Does the author have any good solution? Thank you very much, I look forward to your reply!