On the dataset configuration

beotborry commented 11 months ago

First of all, thank you for your wonderful project and congrats your acceptance!

I have a few questions about the NSD dataset.

It is stated that there are 24,980 training examples and 2,770 test samples in your paper. Does it mean that there are 24,980 training pairs of (voxel, img) and 2,770 for text per subject? I give you this question as the num_train = 8557 + 300 and num_val = 982 in your code and the number does not match. 8857 3 = 26571, 982 3 = 2946. So, could you clarify where these numbers come from?
After I scrutinize the training data of subject1, I found that there are duplicates of images. That is, for example, images for 0th tuple and 400th tuple are the same. I checked that all 982 pairs in the test set and it does have all distinct images. However, training set does not. So I wonder if these duplicates in the training dataset are okay. (I checked that there are 5970 unique tuples in the training dataset)
In addition, your codes seem to have an error on making validation loader. If 982 % batch_size != 0, then the last incomplete batch seems to be dropped. I think it has to be fixed and wonder if there can be a change in your reported results in the paper. https://github.com/MedARC-AI/fMRI-reconstruction-NSD/blob/4d02ab3b63e45bb4e35c15f5d433a1ad0569ee77/src/utils.py#L336

Thank you.

XuZhang2 commented 10 months ago

@beotborry Hi, I encountered the same issue while training. The actual training steps are four times larger than the expected ones (1104 vs. 276). How did you solve this problem? Do I need to change the values of num_train and num_test variables? I appreciate your help. Besides, I find that 8859+300 seems to be the right number. Maybe the problem is the ``get_dataloaders'' function.

PaulScotti commented 10 months ago

Sorry for the delay, all images in the Natural Scenes Dataset were seen up to 3 times by the subjects. During training we train the model using every sample (which explains there being 3 duplicates) but during testing we averaged across the same-image repeats (which explains the lack of 3 duplicates). Given that the validation is the test set, there are 982 unique test images so there shouldnt be an incomplete batch (in which case partial should probably be set to True to allow incomplete batches to go through). num_train should actually be multiplied by 3 because of how we are processing the not-averaged-across-repeats, so that's a mistake on our part (and was used for the results in the paper) ... I dont know how much that would tangibly affect results, probably not much.

song-wensong commented 9 months ago

Sorry for the delay, all images in the Natural Scenes Dataset were seen up to 3 times by the subjects. During training we train the model using every sample (which explains there being 3 duplicates) but during testing we averaged across the same-image repeats (which explains the lack of 3 duplicates). Given that the validation is the test set, there are 982 unique test images so there shouldnt be an incomplete batch (in which case partial should probably be set to True to allow incomplete batches to go through). num_train should actually be multiplied by 3 because of how we are processing the not-averaged-across-repeats, so that's a mistake on our part (and was used for the results in the paper) ... I dont know how much that would tangibly affect results, probably not much.

In the provided JSON file (metadata_subj01.json), the values for num_train, num_val, and num_test are as follows:

"train": 8559,
"val": 300,
"test": 982

However, in the initial code snippet, the default values for num_train, num_val, and num_test are specified as follows:

"num_train": 8859,
"num_val": 982,

I have the following questions regarding the differences:

Why is there a discrepancy between the values of train in the JSON file and num_train in the default code?

Additionally, why is num_val equal to test in the JSON file?

PaulScotti commented 9 months ago

Because when we were developing the model we used the validation set as our test set. For final models used in paper we consolidated the validation set into the training set and used the test set as our test set. 8559 + 300 = 8859

MedARC-AI / fMRI-reconstruction-NSD

On the dataset configuration #38