How to avoid introducing text information in denoising UNet?

Lavreniuk / EVP

[ECCV 2024] EVP model for metric depth estimation from a single image and referring segmentation

https://lavreniuk.github.io/EVP/

MIT License

73 stars 6 forks source link

How to avoid introducing text information in denoising UNet? #17

Open RuiTianHIT opened 3 months ago

RuiTianHIT commented 3 months ago

Dear author! We are interested in your high-quality and excellent work. We want to explore the ability of depth estimation when the model does not introduce text information. However, we forced self.conditioning key == None, and an error occurred during this process. Does the author have any good solution? Thank you very much, I look forward to your reply!

Lavreniuk commented 3 months ago

@RuiTianHIT , the main idea of EVP was to solve the problem that Unet needs the text input, but in the depth datasets there are no text available. So, I don't know other variants how to solve this problem, except to generate text embeddings from text or direct from image (using CLIP image encoder)...