KamitaniLab / GenericObjectDecoding

Demo code for Horikawa and Kamitani (2017) Generic decoding of seen and imagined objects using hierarchical visual features. Nat Commun https://www.nature.com/articles/ncomms15037.
146 stars 46 forks source link

four models #2

Closed Yingying-H closed 6 years ago

Yingying-H commented 6 years ago

Excuse me, I can't understand why you used four types of computational models. CNN was enough to be used in these experiments and the results also demonstrated that CNN was more effective than others.

horikawa-t commented 6 years ago

Hi, HYYJane,

We used those three types of computational models other than the CNN model, which were frequently used in several previous studies (e.g., Khaligh-Razavi & Kriegeskorte, 2014), to compare the results with those obtained with the CNN model. The comparisons between the CNN model and the other models suggest the effectiveness of the CNN model.

Best, Tomoyasu

Yingying-H commented 6 years ago

Hi, Tomoyasu, Thanks for your reply, I got the point. But now I also have some other questions. The first one is how to extract these feature values using a total of 13 visual feature types/layers? And from the codes you provided, we get to know that CNN layers extracted feature values in series, but how did the other 5 layers extract feature values? And did they run before or after CNN layers? Which method did they run, in series or in parallel? And in figure 10, identification accuracy was performed in each feature layers and all combination, so how did they combine together?
The second question is about the units in each feature type/layer. In page 2 of the article, RESULTS part, you mentioned that "~1000 units for each feature type/layer";in page 9, "all 13 feature types/layers (13024 units)." but in page 12, CNN part, "we randomly selected 1000 units in each of the first to seventh layers and used all 1000 units in the eighth layer." so how to understand the number of units in each type/layer? The third one is how to understand the similar or related categories in page 7? In figure 9a, I searched some pictures in the light of the name ranked in the grid, except the red color, they are so different among them, and your analysis showed that categories ranked in higher positions tended to show shorter semantic distances to target categories, but I don't think the semantic distance between fire extinguisher and sagebrush lizard is the shortest. So how to understand this part? I have full interest in this research and I hope you don't mind my asking, can we contact for further details by emails? (My email: snowyy151314@163.com) Looking forward to your reply!

Best, HYYJane

horikawa-t commented 6 years ago

Hi, HYYJane,

To the first question: Actually, we have provided codes only for extracting CNN features. For other 5 visual features from 3 models (gist, SIFT, HMAX1-3), you can use codes provided from websites of the previous studies. GIST Oliva, A. & Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42, 145–175 (2001). SIFT Vedaldi, A. & Fulkerson, B. VLFeat: An open and portable library of computer vision algorithms. (version 0.9.9) (2008). HMAX Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M. & Poggio, T. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell. 29, 411–426 (2007).

Regarding the last point of the first question, we just concatenated all visual features to construct a vector with 13024 feature units (see below for the number of units of each visual feature type/layer).

To the second question: These information was described in the Methods section of our paper. For CNN features, we randomly selected 1000 units for CNN1-7, and used all 1000 units of CNN8. For SIFT, we set the number of visual word to 1000. For GIST, we used 1024 features calculated with our GIST parameter settings. For HMAX, we randomly selected 1000 units from S1 feature, S2&C2 features, and used all C3 features. Units from these feature types/layers were in total 13024.

To the third question: Actually, as shown in Figure 9, while we found the tendency that the highly ranked categories tended to be similar to target categories, the effect was moderate (correlation coefficients were around 0.2-0.4). Therefore, some highly ranked categories can be semantically dissimilar to target categories. But still such tendency was robustly observed especially with higher CNN layers.

If you want to ask more, feel free to contact to me by e-mail (horikawa-t@atr.jp).

Best, Tomoyasu