ReedOnePeck / MindDiffuser

MIT License
59 stars 2 forks source link

Some questions about the semantic feature decoding #9

Closed xyhanHIT closed 4 months ago

xyhanHIT commented 5 months ago

Hello, when I run the code "", I find the following issues:

  1. In Line 44, ”X.shape[1]“ reports a dimension error. In fact, X seems to have only one dimension.


  1. I try to change "X.shape[1]" to "X.shape[0]", then the regression result displayed in Line 84 is shown below, I want to konw if 0.53 is a normal result?


  1. In Line 150, "decode_LDM_text_feature" needs a parameter "args.cls_token_path", I don't konw how to get this cls file, and the "Namespace" doesn't have this attribute.


ReedOnePeck commented 5 months ago
  1. The value seems to be a bit high. The mean I reported in the paper is around 0.26. You need to check to ensure that X.shape[0] is indeed the number of data points in your dataset.
  2. After text feature extraction, the dimensions of a sentence are (1, 20,768), where the first dimension among the 20 dimensions is the cls_token. This token is exactly the same for all sentences and does not need to be decoded. In reality, in the file, we only decode the last 19 dimensions, and then merge the cls_token with the decoded tokens during the image reconstruction phase. You can extract and save the cls_token yourself after the text feature extraction.

xyhanHIT commented 5 months ago

Thanks for your reply! I have another problem when I run the file "". In Line 383, the parameter "loss_CLIP_weight" seems to have not been mentioned before, what is the value you set during the iteration? 微信截图_20240329150302

ReedOnePeck commented 5 months ago

Sorry for forgetting to change it, this weight refers to the weight represented by the value of the last layer of the CLIP, and should be set to 0