Open wlc1256630 opened 2 weeks ago
Hello, @wlc1256630, sorry for the late reply. 1) It's a fascinating question. It's more like a results-driven conclusion the CLIP could be used to obtain image features, that are consistent with our visual systems. 2) SOA=200 ms is enough for visual processing to object recognition. You may see https://www.youtube.com/watch?v=JhpvpHlfPlE&t=46s. Actually, I think that's a balance between stimulus time and data scale. Another issue is that the pre-a nd post- stimulus would have interference.
The work you have done in this article is sufficient and has made a great contribution to the further development of this field, but I have a question: I saw in the NICE article that the image encoder used is the same pre-trained CLIP model. How can we ensure that the image information extracted by the image encoder is close to the image information that the human eye focuses on when collecting EEG signals? After all, the EEG signal collected using the RSVP paradigm, the image sequence will flash by very quickly, or this problem has no effect on the image-EEG feature alignment, so I have some questions in this regard