Closed xyhanHIT closed 4 months ago
After text feature extraction, the dimensions of a sentence are (1, 20,768), where the first dimension among the 20 dimensions is the cls_token. This token is exactly the same for all sentences and does not need to be decoded. In reality, in the Semantic_feature_decoding.py file, we only decode the last 19 dimensions, and then merge the cls_token with the decoded tokens during the image reconstruction phase. You can extract and save the cls_token yourself after the text feature extraction.
Thanks for your reply! I have another problem when I run the file "Reconstruction.py". In Line 383, the parameter "loss_CLIP_weight" seems to have not been mentioned before, what is the value you set during the iteration?
Sorry for forgetting to change it, this weight refers to the weight represented by the value of the last layer of the CLIP, and should be set to 0
Hello, when I run the code "Semantic_feature_decoding.py", I find the following issues: