Unable to reproduce the data preprocessing

Boltzmachine commented 2 months ago

I tried to modify the dataset_creation.ipynb to reproduce the data preprocessing. I found all of the rest (changemind, isold, iscorrect, etc. can match the dataset you provided). However, the betas_all I obtained is different from the file you provided, i.e. betas_all_subj01_fp32_renorm.hdf5.

Below is the procedure to get betas_all in my script

import time 
from subprocess import call
import numpy as np
import matplotlib.pyplot as plt
import h5py
import os
import copy
import pandas as pd

from tqdm import tqdm
from PIL import Image
from sklearn.preprocessing import StandardScaler

import webdataset as wds
import sys

# nsd_access is from this repo: https://github.com/tknapen/nsd_access
# also see https://cvnlab.slite.page/p/dC~rBTjqjb/How-to-get-the-data for how to download the NSD data!
from nsd_access import NSDAccess
nsd_path = '/gpfs/radev/pi/ying_rex/wq44/natural-scenes-dataset'
nsda = NSDAccess(nsd_path)

import nibabel as nib

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)

shared1000 = np.load("shared1000.npy")
for sub in [0]: #,1,2,3,4,5,6,7]:
    subject=f'subj0{sub+1}'
    subj=subject
    print(subject)

    abs_cnt = -1
    abs_notshared1000_cnt = -1
    abs_shared1000_cnt = -1

    # load coco 73k indices
    indices_path = "./data/COCO_73k_subj_indices.hdf5"
    hdf5_file = h5py.File(indices_path, "r")
    indices = hdf5_file[f"{subj}"][:]

    nsessions_allsubj=np.array([40, 40, 32, 30, 40, 32, 40, 30]) 
    nsessions=nsessions_allsubj[sub];
    ntrials = 750*nsessions

    file = f"./data/natural-scenes-dataset/nsddata/ppdata/{subject}/func1pt8mm/roi/nsdgeneral.nii.gz"
    nifti = nib.load(file) 
    mask = nifti.get_data()
    mask[mask<1] = 0 
    nsdgeneral_mask = mask

    for tar in tqdm(range(nsessions)):
        sess=tar+1

        behav = nsda.read_behavior(subject=subject, 
                    session_index=sess, 
                    trial_index=[]) 

        # pull single-trial betas and mask them
        betas = nsda.read_betas(subject=subject, 
                            session_index=sess, 
                            trial_index=[], # empty list as index means get all for this session
                            data_type='betas_fithrf_GLMdenoise_RR', # GLMSingle beta2
                            data_format='func1pt8mm') 

        # betas = betas[mask]
        betas = np.moveaxis(betas,-1,0)

        vox_include = copy.deepcopy(nsdgeneral_mask)
        # ncsnr = nib.load(f"/home/wq44/data/natural-scenes-dataset/nsddata_betas/ppdata/{subject}/func1pt8mm/betas_fithrf_GLMdenoise_RR/ncsnr.nii.gz").get_fdata()
        # ncsnr[ncsnr<.15] = np.nan 
        # if tar==0: print("voxels left:", len(vox_include[vox_include>0]))
        # vox_include[np.isnan(ncsnr)] -= 1 # keep all nsdgeneral voxels even if they are below the threshold
        # vox_include[vox_include<0] = 0
        # if tar==0: print("voxels left after ncsnr thresholding:", len(vox_include[vox_include>0])) # subj01 = 49329

        betas = betas.reshape(len(betas),-1)
        betas = betas[:,vox_include.flatten().astype(bool)]
        shape = betas.shape

        train_index = []
        for i in range(len(behav)):
            global_trial = i
            coco73 = int(behav.iloc[i]['73KID'])-1
            if shared1000[coco73]:
                continue
            train_index.append(global_trial)
        global_trial = np.array(train_index)
        scalar = StandardScaler(with_mean=True, with_std=True).fit(betas[train_index]) # YOU SHOULD EXCLUDE SHARED1000 FROM THIS (NOT DONE HERE BUT DONE IN ACTUAL MINDEYE2 PAPER)
        betas_mean = scalar.mean_
        betas_std = scalar.scale_
        betas = (betas - betas_mean) / betas_std
        betas = betas.reshape(shape).astype('float16') # (1, 15724)    

        globals()[f'betas_ses{sess}'] = betas
        globals()[f'behav_ses{sess}'] = behav
        print(betas.shape)

    for tar in range(nsessions):
        sess=tar+1

        if sess==1:
            betas_all = globals()[f'betas_ses{sess}']
        else:
            betas_all = np.vstack((betas_all,globals()[f'betas_ses{sess}']))

Boltzmachine commented 2 months ago

Hi just a follow-up on this issue. Really appreciate it if you can help with that!

PaulScotti commented 2 months ago

Hi, this code wasn't actually the code we used to get the betas files in the huggingface repo -- actually the betas we obtained were created by coauthor Reese Kneeland as described in the MindEye2 paper -- this script was what we were initially using before Reese provided updated betas (which explained the commented out part of the code saying "YOU SHOULD EXCLUDE SHARED1000 FROM THIS", this is the reason why we switched to Reese's betas). Note that both the betas obtained from this script and the ones provided by Reese led to nearly the same results though. I'd recommend you just take the data directly from the Natural Scenes Dataset: https://natural-scenes-dataset.s3.amazonaws.com/index.html

beotborry commented 2 months ago

@PaulScotti Actually, I have the same issue. Could you more elaborate the data processing procedure?

I used betas_fithrf_GLMdenoise_RR with func1pt8mm and ensured that the Shared1000 data is not included in the calculation of mean and std of each session. But, I failed to reproduce your provided betas.

I have succeeded to reproduce the Mindeye1's data but the same approach does not seem to work on Mindeye2.

beotborry commented 2 months ago

Update: I resolved the problem. In the mindeye2, it seems that all the mean and std value are calculated with entire training dataset (not session-wise).

PaulScotti commented 2 months ago

Ah, yes that would explain it. Sorry I forgot that there was this difference!

MedARC-AI / MindEyeV2

Unable to reproduce the data preprocessing #35