Data preprocessing - Githubissues

nihaoxiaoli commented 4 months ago

Hello, thank you for your excellent work. May I ask if it is possible to provide the processing code for raw NSD data to 'webdataset_avg_split'? I know how to convert from volume to voxel, but I want to know how to divide training, validation, testing, and how each sample is named, such as 'sample00000300'. Looking forward to your reply.

PaulScotti commented 4 months ago

Have you seen our latest work MindEye2? We share the dataset creation script for it here: https://github.com/MedARC-AI/MindEyeV2/blob/main/src/dataset_creation.ipynb

Generally speaking though, the test set are all the shared1000 samples and the train/val are the non-shared1000 samples. The train/val distinction was random (something like 10% of the train become the val set). All samples were shuffled after being allocated to train/val/test, so the sample numbering system was arbitrary.

Hope that helps!

nihaoxiaoli commented 4 months ago

Have you seen our latest work MindEye2? We share the dataset creation script for it here: https://github.com/MedARC-AI/MindEyeV2/blob/main/src/dataset_creation.ipynb

Generally speaking though, the test set are all the shared1000 samples and the train/val are the non-shared1000 samples. The train/val distinction was random (something like 10% of the train become the val set). All samples were shuffled after being allocated to train/val/test, so the sample numbering system was arbitrary.

Hope that helps!

Thank you for your assistance in answering my question !

Boltzmachine commented 3 months ago

Could you share the preprocessing script for mindeye 1? I cannot reproduce the results by modifying the one for mindeye 2.

PaulScotti commented 3 months ago

That code is not in a state that would be usefully sharable, but all in theory you need is to download the NSD dataset and use the functions in this repo (https://github.com/tknapen/nsd_access) to extract the betas from nsdgeneral ROI. The webdataset format used for MindEye1 is just shuffling those betas + images into tar files split by train/val/test, and doing voxelwise z-scoring based on the train split

Boltzmachine commented 3 months ago

I see. In the script https://github.com/MedARC-AI/MindEyeV2/blob/main/src/dataset_creation.ipynb, you have behav, past_behav, old_behav, future_behav. The voxel inputs for MindEye 1 is of size (batch_size, 3, num_voxels), may I know which one I should choose for MindEye 1's inputs and what does the "3" correspond to?

PaulScotti commented 3 months ago

the 3 corresponds to the 3 image repeats -- subjects saw each image 3 times across the scans

you'd want to use behav to get the intended voxel indices if you were adopting the MindEye2 dataloading approach; note that MindEye1 and MindEye2 use different webdatasets and different data loading approaches

Boltzmachine commented 3 months ago

Thanks so much for the reply!! I am using mindeye1's way to load the dataset. But mindeye1 training pipeline accepts voxels (batch_size, 3, num_voxels) but behav generated by the mindeye2's script is of size (batch_size, num_voxels), so I am very confused.

PaulScotti commented 3 months ago

yeah we grouped the repeats together for ME1 but not for ME2

Why not just download the ME1 webdataset we provided if you are running ME1? The ME1 and ME2 datasets were not meant to be used interchangeably across the two papers; you'd need to do some coding changes that I can't commit time to help with if you want to modify ME1 code to work with ME2 data

Boltzmachine commented 3 months ago

It's just because I need to use all voxels of the fMRI data so I need to rerun the preprocessing script. So only thing I need to do now is to group behav by image_idx?

PaulScotti commented 3 months ago

if you need to use all the voxels then you can do either of these:

for ME1 if you dont mind the fact that we dont include the last 3 sessions from Algonauts, we already provide the whole brain data in the tar files in the webdataset_avg_split folder (that's why these tar files are a larger file size than webdataset_avg_new): https://huggingface.co/datasets/pscotti/naturalscenesdataset/tree/main/webdataset_avg_split/train

for ME2 create a new betas_all_subj01_fp32_renorm.hdf5 file by concatenating across voxels pulled from nsda_access (https://github.com/tknapen/nsd_access)

or yes you could manually implement code to group the same repeats from behav, but note youd first need to preload all the samples in order to find the repeats whereas behav defaults to just the batch size not the full dataset

Boltzmachine commented 3 months ago

Oh, I have been using webdataset_avg_split but why do different subjects have different number of voxels? In my impression the number of voxels should be the same if it is the whole brain data?

nihaoxiaoli commented 3 months ago

Oh, I have been using webdataset_avg_split but why do different subjects have different voxels? In my impression the number of voxels should be the same if it is the whole brain data?

Hello, I am the person who raised this issue. You are probably wondering how the images are matched with the fMRI signals, right? It's in the original data nsddata/experiments/nsd/nsd_expdesign.mat.

You should first check the NSD official data introduction at: https://cvnlab.slite.page/p/CT9Fwl4_hc/NSD-Data-Manual

I hope this can answer your question. If you have any more questions, please create a new issue for inquiry.

MedARC-AI / fMRI-reconstruction-NSD

Data preprocessing #48