doubts about text-guided encoding

Thank you for opening the source code of this work. Basically, I found that the released models do indeed achieve the performance claimed in the paper. However, I have serious doubts about whether the text truly helps with compression, for two main reasons:

During training, random cropping can lead to discrepancies between the text and the image. For example, for an image with the caption "a boat in the water next to a pier," random cropping during training may result in the region of the boat not being fed into the neural network, which is likely quite common.
I used your released weights and fixed all input text to "NA", and the results did not show any obvious differences. This suggests that the text does not aid in encoding; rather, it is the loss functions and the additional model complexity introduced by the text adapter that are truly beneficial for encoding.

Based on the above reasons, I think that this paper does not fulfill the claimed text-guided encoding, which may mislead other readers.

effl-lab / TACO

doubts about text-guided encoding #2