Open xiankgx opened 7 months ago
I also wonder what happens if you augment the labels during training. For e.g., an AI image could be randomly selected from say:
Perhaps something would make use of the text-modality a little more to boost performance?
It seems like you are using CLIP with 4 possible textual description and then use cosine similarity for classification, just like CLIP. However, unlike CLIP where the cardinality of the labels, i.e., number of possible text sentences is practically unlimited (in training at least), whereas in LASTED it is only 4. I wonder how much of an uplift is there if we are to train on the same CLIP image encoder, LASTED vs something like just adding a regressor head on top of CLIP image encoder using standard multi-class categorical class entropy loss.