Closed fcakyon closed 2 years ago
@fcakyon thanks for your feedback!
So, we don't use transformers.CLIPFeatureExtractor because it isn't differentiable operation (they use PIL.Image transforms to make resize, etc.). As you can see from VQGAN-CLIP sample, we can make backpropagation to input image. To do it, we use our differentiable implementation of the mapping input image to valid input tensor of the visual feature extractor.
@bes-dev makes perfect sense! Then this implementation is necessary for vqgan
example and not essential for cliprcnn
example, right?
@fcakyon yes, we use CLIP guided loss inference only for ranking without backward path in ClipRCNN.
@bes-dev thanks for the awesome work! I have one question:
Why do you manually map the image pixels between -1 and 1, instead of directly using the https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPFeatureExtractor?