Open DylanLIiii opened 6 months ago
I played with this architecture as part of my engineering project for generating art from neural data, while I did not apply the scientific method rigorously I can say that there are methods of penalising the model in order to make it generalise better. There are methods to tweak the loss function from overfitting by programming timely matter incurring of penalties based on human feedback as well as unfreezing different layers of the diffusion model.
@DylanLIiii Hello, I would like to ask for your advice. Do you have such a question: Is there any inherent connection between the image features extracted by the pre trained CLIP model and the EEG features extracted by the EEG encoder? I have read several articles on EEG-Image matching, and the EEG data they use is the EEG signal changes caused by visual stimuli collected through the RSVP paradigm. What I am puzzled about is that our human eyes should pay more attention to the obvious information such as color and shape presented in the short time of image stimuli, and the image information extracted by CLIP may not necessarily be the color or other information we want. So how are these two established during feature alignment? Although similarity calculation is used for matching, I feel it is a kind of hard alignment. There is no inherent connection between these two modal data.
@DylanLIiii Hello, I would like to ask for your advice. Do you have such a question: Is there any inherent connection between the image features extracted by the pre trained CLIP model and the EEG features extracted by the EEG encoder? I have read several articles on EEG-Image matching, and the EEG data they use is the EEG signal changes caused by visual stimuli collected through the RSVP paradigm. What I am puzzled about is that our human eyes should pay more attention to the obvious information such as color and shape presented in the short time of image stimuli, and the image information extracted by CLIP may not necessarily be the color or other information we want. So how are these two established during feature alignment? Although similarity calculation is used for matching, I feel it is a kind of hard alignment. There is no inherent connection between these two modal data.
Actually, I'm also a newcomer in the field of neuroscience. I've been thinking about the issue you mentioned. After reading some papers on generating (or reconstructing) images through EEG, I noticed that some claim that only EEG signals within a certain time window can yield good results. This window is generally larger than the RSPV setting, which could be considered a kind of delay? (I'm not sure if my understanding is correct). Anyway, I think that although the information people notice first is color and so on, after a few hundred milliseconds, the neural system has completely processed the information of the entire image? I've been doing some validation and modeling of all the EEG/MEG generative work lately, so if you're interested in talking about it, get in touch.
I played with this architecture as part of my engineering project for generating art from neural data, while I did not apply the scientific method rigorously I can say that there are methods of penalising the model in order to make it generalise better. There are methods to tweak the loss function from overfitting by programming timely matter incurring of penalties based on human feedback as well as unfreezing different layers of the diffusion model.
I understand these practices. I believe that generative models have great potential to do some amazing things. However, the original intention is to restore accurate images, rather than creative generation. I think the generalization you mentioned largely comes from the strong prior knowledge associated with training Diffusion models.
The pre-training results of the article are completely irreproducible. The paper lacks even the basic preprocessing steps for the data, and the datasets used in pre-training exhibit significant magnitude differences without any preprocessing! The seemingly good results of the study are solely due to the mandatory alignment in the final step on the Clip Encoder. While this approach undoubtedly yields positive outcomes, it prevents the model from generalizing beyond the local dataset. I am uncertain of the value of such work. If one were to treat EEG as a time series for training, I would recommend adopting methods from works like TiMAE instead.