Analyses of fMRI datasets

jeremymanning commented 7 years ago

Generate figures showing decoding accuracy as a function of "level" for several fMRI datasets. Suggestions:

Pie Man story listening dataset. Separately analyze three conditions (intact, paragraph, word, and rest) and make decoding-by-level figures for each condition. In our prior work we've found that decoding accuracy decreases as you move through those four conditions-- intact > paragraph > word > rest. (Details and intuitions are in the linked-to-paper.)
Sherlock movie watching dataset. Show decoding accuracy by level.
Raiders of the Lost Ark movie watching dataset. Show decoding accuracy by level.

Here's what I'm thinking for the figures 1) Start with the raw data and split the data (or for Pie Man, the data in each condition) into two groups. Compute the average within groups, and then use correlation-based decoding (like in timepoint_decoder) to compute decoding accuracy. Repeat 100 times (with a different random group assignment each time) to get a distribution of decoding accuracies. (This will be the "level 0" decoding accuracy.) 2) Now split the data into 2 groups and compute the "across" timecorr for each group. Use one group to decode the other (using correlation-based decoding) and compute the decoding accuracy. Repeat 100 times (with different group assignments each time) to get a distribution of decoding accuracies. (This will be the "level 1" decoding accuracy.) 3) Also compute the "within" timecorr for each individual subject. Now use hyp.tools.reduce to reduce the "level 1" data down to v voxels (where v is a parameter we will probably want to explore...but I'd start with something like 1000). Then split the subjects into two groups and use "across" timecorr on the level 1 data to get "level 2" decoding accuracy (again, use 100 random group assignments to get a distribution of accuracies). 4) For each subsequent level, L you can keep track of the "within" timecorr values for level L-1, and then to get the level L decoding accuracies use "across" timecorr for two randomly assigned groups. To level up, compute the "within" timecorr values for level L (using the L-1 "within" timecorr values), and then reduce to v features using hyp.tools.reduce). 5) I'd like to see decoding accuracy (with error bars) as a function of level, for levels 0 through 10.

Some key questions:

How does decoding accuracy change with level? Does this depend on our choice of v?
Is the pattern we observe consistent across these different datasets?
Using the Pie Man dataset to explore, does the pattern we observe change with the level of cognitive salience (i.e. the richness of the input stimulus-- intact > paragraph > word > rest)?

jeremymanning commented 7 years ago

I'm also flagging an alternative version of this analysis here for possible exploration, as a different way of exploring decoding accuracy by level. It could be that, even if data at higher levels didn't by itself lead to better decoding accuracy, mixing in information at higher levels could still improve the decoding. In other words, a mix of level 0 and level 1 might be better than level 0 alone, even if level 0 out-performs level 1.

To test this more formally, I suggest the following:

1) split the dataset into 2 parts, A and B. further subdivide each of those into two-- so we have A1, A2, B1, and B2. compute the level 0, 1, 2, ..., 10 representations of the data for each of those 4 subset of the data (using the "across" version). So for each subset, for each level, we should have one timepoints-by-features matrix. 2) using A1 and A2, find the vector of mixing proportions that maximizes decoding accuracy (in other words, find the mix of level 0, level 1, level 2, etc. produces the best decoding accuracy). 3) using this mixing proportion from A1 and A2, compute decoding accuracy for B1/B2. 4) repeat the above 100 times (with different random assignments of A1/A2/B1/B2 each time) to get a distribution of decoding accuracies.

Then plot:

the average optimal mixing proportions across those 100 runs (as a bar graph or violin plot)
the average decoding accuracy for B1/B2 across those 100 runs, plotted as a bar graph or violin plot. Also compare the decoding accuracy to the decoding accuracy obtained using only level 0 data.

jeremymanning commented 7 years ago

If you download and install the brain-dynamics-model package, you can use the following commands to load in a .nii file:

from loadnii import loadnii as ln data = ln(fname)

now data.Y is a number-of-timepoints by number-of-voxels matrix and data.R is a number-of-voxels by 3 matrix of voxel locations.

to verify that you’ve done things correctly you can plot the data with hypertools using:

import hypertools as hyp hyp.plot(data.Y) <— this should look like a scribble in 3d hyp.plot(data.R, 'k.') <— this should look like a brain (a bunch of black dots in 3d

To facilitate these "real data" analyses, I'd recommend that you: 1) download each dataset into a separate folder on discovery and organizing the data so that there's one .nii or .nii.gz file per subject (for the "Pie Man" dataset, you'll want a separate folder for each of the 4 conditions: pieman-intact, pieman-paragraph, pieman-word, and pieman-rest) 2) write a load_fmri_data function that takes in a folder of .nii or .nii.gz files and returns a list of numpy arrays with the data-- e.g. data = load_fmri_data(folder). 3) also write a decoding analysis function that takes in a folder, an ndims argument, an nlevels argument, and an ngroups argument and then:

loads in the data using load_fmri_data
reduces the data to ndims dimensions using hyp.tools.reduce
performs the timepoint decoding analysis (using ngroups cross validation folds; default: 2) on each of the nlevels levels (where level 0 is the "raw" reduced data, level 1 is levelup(data), etc.). 4) write a cluster script for discovery or ndoli that takes in a folder and then submits the decoding_analysis job 100 times (with new random group assignments each time) and saves out each set of results 5) write a function to collect all of the results and generate plots

Then you'll have your pipeline for quickly analyzing all of the fMRI datasets mentioned above.

TomHaoChang commented 7 years ago

To clarify, should I use ndoli or discovery for this analysis?

In addition, I think there's a memory quota on my account and I was not able to fully download the new_pieman dataset onto my account. Is it possible to organise these folders in another location on the cluster for easy access?

Are the three datasets you mentioned above sufficient? Or do we need to run more datasets?

Regarding the alternative analysis, could you clarify what you mean by "mixing"? Are we appending the information from one level onto another? In addition, I did not quite understand the reasoning behind dividing the dataset into four groups. To get a distribution of decoding accuracies with respect to level mixture, isn't it sufficient to test on two groups? Also, this idea of mixing reminds me of the skip connection method used in this Deep Residual Network paper: https://arxiv.org/abs/1512.03385. If that's the case, will mixture of 2 levels be sufficient? Or should we mix more than 2 levels? Or, to put it to the extreme, a mixture of all the levels? I feel like this idea needs to be explored on the synthetic dataset first to determine the actual benefits before we move onto real datasets, or else it's really hard to justify the results.

In addition, another question that has been bugging me is: is higher decoding accuracy always better? For synthetic datasets with no noise, higher decoding accuracy is indeed better, but what about datasets with lots of noise? I thought that timecorr was initially invented to differentiate timepoints with little noise from timepoint with a lot of noise so that we are able to better identify data that are relevant to the study at hand. We were able to establish that timecorr has the ability to identify datasets with low noise from datasets with high noise from our levelup tests. However, the problem with decoding analysis is that it looks at the dataset as a whole but does not focus on individual timepoints. So when moving on to real datasets to achieve actual results, should we be focusing on analysing the decoding accuracy of each dataset as a whole and make an argument on the quality of the datasets? Or should we try to find out which of the timepoints in the dataset are more task-relevant?

TomHaoChang commented 7 years ago

Also, I think it might be a good idea to separate load_fmri_data and decoding analysis into two separate steps. Loading in the data from nii format takes up a lot of time and memory, doing that simutaneously from 100 processes will be a nightmare...

Would it be better to load all the nii files, reduce into a list of numpy arrays and then store into a file to be read by decoding analysis? I think this would speed up the process by a lot...Or was the original intention to make this process available as a tool to the user too?

In addition, it would also save a lot of time if we do levelup through the levels once and save the activations at all the levels. Decoding analysis can then read in the activations and do random grouping and decoding. This would also speed up the process tremendously

TomHaoChang commented 7 years ago

Another problem is the file format to which to save the list of numpy matrices. What would be the best format to do this?

In addition, is it essential to maintain the list of numpy matrices format or could I convert the data to a 3-dimensional numpy array as it would make the entire process a lot easier...

TomHaoChang commented 7 years ago

I just found another problem with the reduce method in hypertools: repeated reduction of the same matrix does not yield consistent results. Using the following code:

import numpy as np import hypertools as hyp data = np.random.normal(0,1,[300,1000]) t1 = hyp.tools.reduce(data,100) t2 = hyp.tools.reduce(data,100) print(np.sum(np.absolute(t1-t2))

The output gives a very large number when it should be 0

I finished coding up the decoding process and tested on four subjects with identical nii files, however the decoding accuracy quickly deteriorated to 0 as the level increased, which is not consistent with our synthetic dataset testing results. After debugging, I discovered that this discrepancy came from the reduce method in hypertools. Every time reduce is called, the difference between subject activations is further increased. Thus as we level up, the activations of identical subjects become completely different and the decoding accuracy quickly goes to 0.

jeremymanning commented 7 years ago

To clarify, should I use ndoli or discovery for this analysis? Yes (either one). ndoli is good for running big interactive jobs and discovery is good for submitting many jobs in parallel.

In addition, I think there's a memory quota on my account and I was not able to fully download the new_pieman dataset onto my account. Is it possible to organise these folders in another location on the cluster for easy access?

Your home directory has a quota, and I don't see anything in your lab directory (/idata/cdl/thc/). In any case, these datasets are already on Discovery:

Pieman is at /idata/cdl/data/fMRI/pieman/npy_files. Each subject has a .npy file. To load in the timeseries for one subject (filename: fname.npy):
```
import numpy as np
data = np.array(np.load('fname.npy').T)
```
Sherlock is at /idata/cdl/data/fMRI/sherlock/npy_files. Each subject has a .npy file.
Raiders is at /idata/cdl/data/fMRI/Raiders. It's just the raw download from the GitHub repository for that dataset, so it needs some organizing (you should look at the documentation for that dataset to figure out which files you need, and create a new folder with the organized data-- one data per subject, either as a .nii file, or .npy file in the same format as the Pieman and Sherlock .npy files.

Are the three datasets you mentioned above sufficient? Or do we need to run more datasets?

It's up to you! I'd say looking at three datasets would be reasonable for your thesis.

Regarding the alternative analysis, could you clarify what you mean by "mixing"? Sure; we did a mixing analysis in this paper and I'm suggesting something similar here. For each type of feature (in this case, level), you compute the per-timepoint correlations between the two groups (this yields a timepoint-by-timepoint correlation matrix for each type of feature). Then you create a weighted average of those correlation matrices: 0) Create your vector of positive weights for each level, w (np.sum(w) should be 1). 1) Apply the z-Transform to the correlation matrices 2) Compute the weighted sum of the correlation matrices-- if you have all of the correlation matrices in an array, such that corrmat[n] is the correlation matrix for level n:
zsum = np.zeros_like(corrmat[0])
for n in range(len(w)):
zsum += w[n] * corrmat[n]
3) Apply the inverse z-transform to zsum. Use this to match up timepoints across the two groups. I'm suggesting that you use cross validation (with simple gradient descent) to find an optimal setting of w. Your suggestion to use Deep Residual Networks is interesting, but unless you think you could implement it as quickly as the gradient descent version, I think it'll be a distraction from the main thesis goals.

In addition, another question that has been bugging me is: is higher decoding accuracy always better? For synthetic datasets with no noise, higher decoding accuracy is indeed better, but what about datasets with lots of noise? I thought that timecorr was initially invented to differentiate timepoints with little noise from timepoint with a lot of noise so that we are able to better identify data that are relevant to the study at hand. We were able to establish that timecorr has the ability to identify datasets with low noise from datasets with high noise from our levelup tests. However, the problem with decoding analysis is that it looks at the dataset as a whole but does not focus on individual timepoints. So when moving on to real datasets to achieve actual results, should we be focusing on analysing the decoding accuracy of each dataset as a whole and make an argument on the quality of the datasets? Or should we try to find out which of the timepoints in the dataset are more task-relevant?

Timecorr was designed, not as a decoding tool, but as a means of extracting moment-by-moment correlations. The fact that we're using it for decoding reflects a need to evaluate the quality of timecorr's recovered correlations in real data, without having a ground truth like we have for synthetic data. I'm not sure I understand your point about the amount of "noise" in the data. If the data are noisy we won't be able to decode well, but we still need a way of evaluating our algorithm. And prior work has shown that you can decode well for those 3 datasets I've suggested examining.

Also, I think it might be a good idea to separate load_fmri_data and decoding analysis into two separate steps. Loading in the data from nii format takes up a lot of time and memory, doing that simutaneously from 100 processes will be a nightmare...

None of the datasets is too big to fit in memory-- I've taken a similar approach to what I've suggested, even with Pieman (the largest dataset). Ultimately you just need to store the timepoints-by-voxels matrices (or in your case the reduced timepoints-by-voxels matrices) in memory. The loading from disk only happens once per process, so I don't think the "nightmare" scenario you're worried about will actually be that bad...it has worked well for me!

Would it be better to load all the nii files, reduce into a list of numpy arrays and then store into a file to be read by decoding analysis? I think this would speed up the process by a lot...Or was the original intention to make this process available as a tool to the user too? Yes, given that we're aiming to share this toolbox someday, you need to write your code in a way that can be easily used by others. But I don't see why what you're suggesting means it wouldn't be useable...?

In addition, it would also save a lot of time if we do levelup through the levels once and save the activations at all the levels. Decoding analysis can then read in the activations and do random grouping and decoding. This would also speed up the process tremendously

Aren't you doing this already!? You should save the data at each level, and then apply levelup to the previous level (rather than recomputing from scratch). Computing each level from scratch is unbelievably inefficient!

I just found another problem with the reduce method in hypertools: repeated reduction of the same matrix does not yield consistent results.

The default dimensionality reduction algorithm in hypertools is PCA, which (given the algorithm we're using) isn't guaranteed to always return the same answer on the same data. Specifically, when the dataset you're inputting is large, sklearn defaults to a solver that has a random component (more details here).

That being said, I'm not sure why this is a new problem. For example, with your noiseless synthetic data, you found perfect decoding accuracy from levels 0 -- 10. (E.g. given what you're saying now, I'm confused about how that was possible!) Maybe the noiseless synthetic data would be a good place to start exploring what's going on.

TomHaoChang commented 7 years ago

The levelup function was working well previously because the number of voxels was really small, so the PCA reduction process was very stable across levels. The inconsistency only occurs when the voxel number becomes big, which is why I only discovered this problem now. Currently, I am unsure how to resolve this problem...

Regarding data read-in and levelup storage. Storing activations data for each level requires that we read in the data files and generate the activations for each level. My recommendation is to read in all the data and generate all level activations once, and then run decoding 100 repetitions on it. There is no need to read in the nii data for every repetition if we have already generated the activations for each level, which is why I suggested separating load_frmi_data and decoding analysis into two separate steps. What do you think?

jeremymanning commented 7 years ago

The levelup function was working well previously because the number of voxels was really small, so the PCA reduction process was very stable across levels. The inconsistency only occurs when the voxel number becomes big, which is why I only discovered this problem now. Currently, I am unsure how to resolve this problem...

Interesting...and tricky! I think there are two issues. First, the levelup function takes in the reduced features from the previous level and computes moment by moment correlations. I believe these correlations should be stable across different runs of PCA, but this would be worth verifying. In other words, what changes across iterations of PCA is that the component space may be flipped along some random combination of axes in any given run-- but the moment-by-moment correlations shouldn't change (I don't think) even if the components themselves are flipped.

The second issue is that in our current design we are using the reduced-dimensionality versions of the moment-by-moment correlations to decode, rather than the correlations themselves. So if the reduced dimensionality correlations are flipped along a random set of axes during the reduction, there are no guarantees that we'll get the same answer for the two groups we're trying to decode across. To fix this, I propose changing the levelup function slightly. Rather than taking in a timepoints-by-features matrix and returning a "leveled up" timepoints-by-features matrix, levelup could instead take in the moment-by-moment correlation matrix and return a "leveled up" moment-by-moment correlation matrix. In other words, something like:

def levelup(corrs0, var=None, mode="within"):
    v = 0.5 * (sqrt(8 * corrs0[0].shape[1] + 1) - 1) #number of features
    x = hyp.tools.reduce(corrs0, ndims=v)
    return timecorr(x, var=var, mode=mode)

I think this should be more stable, although you'll need to re-run the synthetic data analyses to verify this.

Regarding data read-in and levelup storage. Storing activations data for each level requires that we read in the data files and generate the activations for each level. My recommendation is to read in all the data and generate all level activations once, and then run decoding 100 repetitions on it. There is no need to read in the nii data for every repetition if we have already generated the activations for each level, which is why I suggested separating load_frmi_data and decoding analysis into two separate steps. What do you think?

This is a really good point, and this design sounds good to me.

TomHaoChang commented 7 years ago

Okay, the change to timecorr took me a while to process, but I think I understand the reasoning behind it now. However, in order to implement this change, I will need to change the timecorr operation in decode() to correlation averaging within group, because the function will be taking moment-by-moment correlation matrix instead of activations matrix. I will also need to redo some of the tests....

Are you free to meet tomorrow to talk about this process in more detail? Preferably sometime in the afternoon...

TomHaoChang commented 7 years ago

Actually, levelup has to return timepoints-by-features matrix in order for decode() to perform the isfc process...unless I am missing something

In addition, decode() actually divides the subject into two groups, processes each group using "across" timecorr and then performs decoding analysis. So in essence it's using moment-by-moment correlation matrix to perform decoding and not timepoints-by-features matrix, so I am not sure if the solution you proposed can resolve the current instability...

TomHaoChang commented 7 years ago

Sorry it's a bit late and I wasn't thinking clearly, I think your solution might actually work. However, there's no guarantee that v = 0.5 * (sqrt(8 * corrs0[0].shape[1] + 1) - 1) would return an integer and that (v**2-v)/2 would be equal to corrs0[0].shape[1] after we round v to an integer. This might introduce inconsistency in matrix dimensions.

Should I preprocess the voxel number that the user specifies to reduce to through the following: ''' v = int(0.5 (sqrt(8 voxel_num + 1) - 1)) new_voxel_num = (v**2-v)/2 ''' This will make sure that all level ups will result in moment-by-moment correlations with the same dimensions

jeremymanning commented 7 years ago

Okay, the change to timecorr took me a while to process, but I think I understand the reasoning behind it now. However, in order to implement this change, I will need to change the timecorr operation in decode() to correlation averaging within group, because the function will be taking moment-by-moment correlation matrix instead of activations matrix. I will also need to redo some of the tests....

I'm not proposing a change to timecorr or decode-- I'm proposing a change to levelup. Not sure what you're referring to here...you shouldn't have to change decode at all.

Are you free to meet tomorrow to talk about this process in more detail? Preferably sometime in the afternoon...

In general: I nearly always need more than a few hours notice to meet, unless a block of time happens to open up in my schedule (e.g. cancelled meeting) as it did last week. I've been encouraging you to schedule a weekly meeting with me. To clarify what I've been asking, it should be a recurring weekly meeting that's scheduled on the CDL calendar, in consultation with me, well in advance of our first meeting. My schedule today (Wednesday...I assume you didn't mean Thursday since you sent your message shortly after midnight?) is already full, so unfortunately we won't be able to meet this week. I'll also be away (out of town) next week, so we can't meet in person next week either. If you check in via Slack we can set up a time to Skype.

Actually, levelup has to return timepoints-by-features matrix in order for decode() to perform the isfc process...unless I am missing something

Yes, that's correct. I'm not sure if you're missing something...but it sounds like you might be...?

In addition, decode() actually divides the subject into two groups, processes each group using "across" timecorr and then performs decoding analysis. So in essence it's using moment-by-moment correlation matrix to perform decoding and not timepoints-by-features matrix, so I am not sure if the solution you proposed can resolve the current instability...

Yes, it's using the correlation matrix to do the decoding. I strongly suggest that you try out what I'm describing to see if it's empirically true-- but my intuition is that the correlation matrix will be stable even if the features are not. The key (as I'm thinking of it) is that even if the answers for any given run of PCA are flipped along some axis (or set of axes), the correlations between those PCA-reduced observations should still be preserved. For example, if you multiply two time series by -1, the correlations between will be unchanged.

Sorry it's a bit late and I wasn't thinking clearly, I think your solution might actually work. However, there's no guarantee that v = 0.5 (sqrt(8 corrs0[0].shape[1] + 1) - 1) would return an integer and that (v**2-v)/2 would be equal to corrs0[0].shape[1] after we round v to an integer. This might introduce inconsistency in matrix dimensions.

First, it looks like a + 1 got lost in my proposed formula-- it should be v = (0.5 * (sqrt(8 * x + 1) - 1) + 1). You're right that this formula doesn't necessarily return an integer-- but it does return a positive integer for any positive integer passed through (v^2 - v)/2 (since it's just the inverse of that function). So if people are using timecorr and levelup in the intended way, they should get consistent matrix dimensions. To prevent "unintended use," I suggest that we throw an error if v is not an integer (i.e. in Python land, if np.mod(v, 1) > tol, where tol is some tiny number like 10^-15). We should also clearly note in the documentation that levelup takes in a number-of-timepoints by (v^2-v)/2 dynamic correlations matrix (or a list of such matrices) and returns a matrix (or matrices) of the same size(s).

TomHaoChang commented 7 years ago

I think the focus of my problem is the dimensions of the output of the levelup function. Right now there are two scenarios:

We set v = 100 and levelup outputs subjects-by-timepoints-by-(v^2-v)/2 dimensional matrix. If we pass this output into "across" timecorr, the runtime will be incredibly high as there will be (100^2-100)/2=5050 features.
We set the output of the levelup function to have dimensions (v^2-v)/2 = 100, which would guarantee low runtime on "across" timecorr. However, this requires preprocessing of v as the equation above may not yield an integer v.

I am leaning toward the second scenario, but there might be loss of information as we are reducing features to a low dimension before calculating isfc. Testing the first scenario might take days...and would require using the cluster computers...I am not sure we want to include such computing-intensive functions in our toolbox

I think the optimal scenario is to get PCA working but it looks like that is not possible. Is there any middle ground we can take?

jeremymanning commented 7 years ago

Hi @TomHaoChang: first, you can use incremental PCA to solve out that dimensionality reduction quickly for big datasets. I believe @andrewheusser is working on incorporating that into hypertools [link]. To use that algorithm within the hypertools framework you'll need to check out the reduction-models branch.

But more generally, I'm still not sure you're understanding what I proposed...or perhaps I'm not understanding your concern correctly. To explain further, the number of features that get passed to timecorr is 100 in your example. All that changes with my new proposed levelup function is how many features get passed to levelup:

in the "old" (current) setup, levelup gets a T by 100 matrix and returns a T by 100 matrix. internally, levelup calls timecorr to get a T by 5050 matrix, and then reduces that (using PCA) back to a T by 100 matrix.
in the "new" setup I'm proposing, you need to pass the raw T by 100 matrix through timecorr to get a T by 5050 matrix. You pass this T by 5050 matrix to levelup. Inside levelup, you first reduce the T by 5050 matrix to a T by 100 matrix using PCA. Then you pass the T by 100 matrix through timecorr, yielding a T by 5050 matrix, which is returned by levelup.

Neither case has a T by 5050 matrix being passed to timecorr, and the dimensionality of the matrices being reduced using PCA isn't different from our current ("old") implementation.

TomHaoChang commented 7 years ago

Hi Professor Manning, thank you very much for clarifying! I think I understand your levelup proposal. My main concern is its compatibility with decode(). Eventually, after we get the outputs from levelup, we need to pass these outputs into decode() to get the decoding accuracy at that level. In the decode() function the "across" timecorr (isfc) is applied on the T by 5050 matrix, which will take a huge amount of processing power, memory and runtime....

I will look into incremental-pca

jeremymanning commented 7 years ago

I don't think it'll be as bad as you're worried about-- it takes O(F*(T^2)) time to run the decoding, so the entire decode process is linear with respect to the number of features. Increasing the number of timepoints is what really increases the run time. I've decoded using this method 30K features and it doesn't take too long (maybe an hour or so).

Another note: these sorts of discussions are what I was hoping you would be able to tell me about using your simulations and benchmarks with synthetic data! So if you're still concerned about run times, you should do more benchmarking to show whether your code scales to the data sets sizes we'll want to run. Then, if those benchmarking runs don't convince you that your code is going to scale to the analyses we'll be attempting next, then more optimizing is needed.

TomHaoChang commented 7 years ago

Hi Professor Manning, referencing our results from the benchmark thread, the runtime of isfc (which is employed by decoding analysis) scales exponentially with the number of features:

But I will try it out and let you know

jeremymanning commented 7 years ago

That graph was from timecorr, not decode...right? I was referring to decode, which was (I thought) what you were concerned about. Perhaps I misunderstood?

TomHaoChang commented 7 years ago

Yep! This graph is from "across" timecorr (isfc). However, decode uses isfc as part of the function...At least that's what I thought it's supposed to do from our previous discussions...

jeremymanning commented 7 years ago

decode should not be running ISFC-- that's run within timecorr (using mode='across') and/or levelup prior to passing anything to decode. decode should just take two matrices (one to-be-labeled matrix and one template matrix) as inputs and return an accuracy value.

TomHaoChang commented 7 years ago

But doesn't decode divide the subject into two groups, apply isfc on each group and then decode on the resulting isfc's?

jeremymanning commented 7 years ago

no:

that's run within timecorr (using mode='across') and/or levelup prior to passing anything to decode. decode should just take two matrices (one to-be-labeled matrix and one template matrix) as inputs and return an accuracy value.

TomHaoChang commented 7 years ago

So we want to do group division before decode:

Divide subjets into two groups
Pass the activations for each group into "across" levelup separately to obtain isfc for each group
Pass the isfc into decode to obtain the decoding accuracy
Pass the previous activations into "within" levelup to obtain activations for next level?

With this implementation we will still be using output from hypertools.tools.reduce for our isfc calculation, which is the same as our currently implementation.

If we separate out the isfc process from levelup, we will be passing the output of levelup (T by 5050) into isfc before passing into decode(), which will cause the exponential increase in runtime...

Am I missing something?

jeremymanning commented 7 years ago

There are three main issues we've been discussing in this thread. Here are my thoughts: 1) Will the outputs of levelup be stable across groups?

This was a good observation and one worth addressing
I think the new proposed design of levelup addresses this concern and should result in stable features
You still need to confirm that this is the case, using synthetic data with the new version of levelup 2) What should be included in the decode function?
The way I was thinking of decode in my most recent comments is that it'd take in two timepoints by features matrices of the same size and then use one to decode the other. That requires having a wrapper function that takes the features from each subject, divides the subjects into two groups, runs timecorr with mode='across' (i.e. continuous ISFC) for each group, and then passes those two ISFC matrices to decode.
An alternative design of decode is to call that entire wrapper function decode. In looking back over our notes, it seems like that's how we had discussed it previously, and now that I've refreshed my memory, I think it was a reasonable design...so let's stick with the old version.
Both implementations have the same basic sequences of steps-- so it's just a matter of what we call decode. In any case, the conclusion of this issue is: let's stick with the current version of decode. 3) Does the decoding pipeline scale to the data set sizes we will want to run?
This is an empirical question: you need to actually try out the analysis on either the data we want to use, or synthetic data of the same size.
One approach: preprocess the fMRI data by using hyp.tools.reduce to reduce the data to something like 100 features (and then treat those as the "level 0" features). Your explorations of synthetic data show that it'll take under 100 seconds to run the largest dataset (Pieman, which has around 30-40 subjects per condition...and you should run all of the conditions separately) through timecorr each time, which is the most computationally intensive piece of the analysis pipeline. So to run the dataset through timecorr 10 times (once per level beyond 0) should take around 1000 seconds (under 20 minutes).
- I think it would be worth exploring how far you can push the number of features-- e.g. 200 features would be substantially better than 100, and if we could get 500 to work, that would bring us into a much more reasonable range.
If you run the analysis in serial, it will indeed take an unreasonably long time to run. But as you've pointed out previously, you can compute the features for each level (~20 minutes if you use 100 features) and then in parallel (e.g. farming the analysis out to many nodes on discovery or ndoli) you can run those features through decode (with a different random group assignment each time)

TomHaoChang commented 7 years ago

Hi Professor Manning! Think you for the detailed analysis of the issues! I think I have a clearer understanding of the overall procedures now.

I have tested and implemented the Incremental PCA tool from sklearn and it resolved all of our problems! By replacing hypertools.tool.reduce with IncrementalPCA, I was able to achieve consistent and stable levelup and decoding analysis results that matches our previous testing and benchmarking without modifying the original steps for both levelup and decoding process. For 10-level decoding analysis of 4 subjects with identical data, I was able to get 100% decoding accuracy across all levels.

I will rerun the tests we had for levelup to see if replacing hypertools reduce with IncrementalPCA changed anything on data with noise.

As for scaling, the performance of our process will be very similar to the 10 subject case, because we are dividing the subjects into two groups which will each contain 10-15 subjects. I am currently running tests with 200 and 500 voxels to see how the code runs

jeremymanning commented 7 years ago

I have tested and implemented the Incremental PCA tool from sklearn and it resolved all of our problems!

This is a very strong statement! Can you describe further how you determined that the issues you described above are no longer a factor now that you have switched to IPCA rather than PCA? How extensively have you re-tested the new code? For example, some aspect of the prior implementation caused you to be concerned that PCA would return incompatible answers for real data. Can you explain further how this is no longer a concern with IPCA?

By replacing hypertools.tool.reduce with IncrementalPCA, I was able to achieve consistent and stable levelup and decoding analysis results that matches our previous testing and benchmarking without modifying the original steps for both levelup and decoding process.

The hypertools reduce function has a very specific way of implementing its dimensionality reduction for multi-subject data. Have you carefully replicated the hypertools procedure in your code? The key elements of the hypertools implementation are that it: 1) Stacks data from the different subjects into a single matrix 2) Performs dimensionality reduction on the stacked matrix 3) Breaks down the reduced stacked matrix back into matrices for each subject

Critically, I want to verify that you're not reducing the data independently from different subjects-- otherwise the features (dimensions) will not align across subjects.

For 10-level decoding analysis of 4 subjects with identical data, I was able to get 100% decoding accuracy across all levels.

This was also the case for PCA, so this isn't (in itself) indicative of anything...

I will rerun the tests we had for levelup to see if replacing hypertools reduce with IncrementalPCA changed anything on data with noise.

This sounds good-- but please verify that you've implemented the non-hypertools IPCA dimensionality reduction using the same across-subjects approach that hypertools.tools.reduce uses. And/or verify that you're using the hypertools implementation of IPCA.

As for scaling, the performance of our process will be very similar to the 10 subject case, because we are dividing the subjects into two groups which will each contain 10-15 subjects. I am currently running tests with 200 and 500 voxels to see how the code runs

Is this a change from the old code, or is this a new features now that you've switched to IPCA?

General comments:

I am concerned that you are now 8 days behind schedule. You seem to be getting distracted by these concerns that the decoding won't work on real data or that the code won't scale to real data. But you recently finished running simulations using synthetic data that were intended to help us explore exactly these issues. So I strongly encourage you to actually try out your old (tested) code on real data as soon as possible (by the end of this week at the latest). If some aspect of our approach doesn't seem to be working we can address that, but these un-tested hypothetical concerns you've been bringing up over the past few days hold little weight for me. If you want to make an argument that something should be changed, you should back it up with a figure or analysis.
Meanwhile, you now made a convincing case that the decoding should be done on the connectivity matrices rather than on the reduced connectivity matrices, which somewhat changes how decode and levelup should be set up. Given how much you've now thought of this issue, I encourage you to explore it further (within limits) for your thesis. I also don't see why IPCA should fix the concerns that you brought up (it's just a batch version of PCA, not a magical fix for everything). So to help us move forward in the most informed way, I think you should to spend at most one more day making a reasoned case (with evidence) for either:
- decoding using reduced features. you can take one of the following approaches, or propose another:
- continue using our original design (reduce with PCA, then decode using the PCA-reduced connectivity features from each level). In other words, show why the concerns you've been raising over the past few days about stability and scalability won't be a problem for real data.
- switch to using IPCA. if so, you need to verify the details of your implementation as described above and re-run all decoding tests with synthetic data. You also still need to show why the concerns you've been raising over the past few days about stability and scalability won't be a problem for real data-- e.g. that, whereas PCA might be susceptible to these issues, IPCA is not.
- decoding using connectivity matrices. you need to show that the connectivity matrices are more stable across groups than reduced connectivity matrices (e.g. by showing improved decoding performance on the same data).

TomHaoChang commented 7 years ago

The hypertools reduce function has a very specific way of implementing its dimensionality reduction for multi-subject data. Have you carefully replicated the hypertools procedure in your code? The key elements of the hypertools implementation are that it:

This sounds good-- but please verify that you've implemented the non-hypertools IPCA dimensionality reduction using the same across-subjects approach that hypertools.tools.reduce uses. And/or verify that you're using the hypertools implementation of IPCA.

I actually was not aware of the stacking implemented by hypertools, so I did not implement multi-subject IPCA correctly...I actually ran 100 repetitions of level up tests overnight, but the decoding accuracy decreased a lot faster than the hypertools.tools.reduce code from before. I think this was mainly caused by my lack of stacking for IPCA, so reduced activations for subject data with noise did not align.

Has IncrementalPCA been implemented on hypertools? If so, I'd love to use it for my code. If not, I can implement this in my code

This is a very strong statement! Can you describe further how you determined that the issues you described above are no longer a factor now that you have switched to IPCA rather than PCA? How extensively have you re-tested the new code? For example, some aspect of the prior implementation caused you to be concerned that PCA would return incompatible answers for real data. Can you explain further how this is no longer a concern with IPCA?

This was also the case for PCA, so this isn't (in itself) indicative of anything...

The main issues that started this conversation is that hypertools.tools.reduce could not consistently reduce large matrices to the same result across different repetitions, which cause alignment issues for isfc and thus caused the decoding accuracy to quickly deteriorate with each levelup. The incrementalPCA however, is able to consistently reduce large datasets to the same result across across repetitions, and thus effectively advert this issue. With IncrementalPCA, our code should back on track.

One example to illustrate this effect is: I randomly chose a subject from the pie_man dataset and copied the data 3 times to produce 4 subjects with identical data and no noise. Using hypertool.tools.reduce implmentation from before, the decoding accuracy quickly deteriorated to 0 after 2-3 levels because of alignment issues. However, with IPCA the decoding accuracy remained 1 across 10 levels of levelup, thus proving that IPCA is able to effectively resolve the alignment issue. I think we should move forward with IPCA

Is this a change from the old code, or is this a new features now that you've switched to IPCA?

I think this is a feature that came with the old version of decode(). When we are doing decoding analysis, we divide the subjects into two groups and find the isfc in each group. thus isfc actually is only performed on half the number of total subjects = 30/2=15

I am concerned that you are now 8 days behind schedule. You seem to be getting distracted by these concerns that the decoding won't work on real data or that the code won't scale to real data. But you recently finished running simulations using synthetic data that were intended to help us explore exactly these issues. So I strongly encourage you to actually try out your old (tested) code on real data as soon as possible (by the end of this week at the latest). If some aspect of our approach doesn't seem to be working we can address that, but these un-tested hypothetical concerns you've been bringing up over the past few days hold little weight for me. If you want to make an argument that something should be changed, you should back it up with a figure or analysis.

Yes, the delay was completely unexpected. However, if we can successfully implement IPCA and avert the inconsistency in levelup, we will be back on track again. I have the rest of the code down, just need to submit the jobs on discovery.

Meanwhile, you now made a convincing case that the decoding should be done on the connectivity matrices rather than on the reduced connectivity matrices, which somewhat changes how decode and levelup should be set up. Given how much you've now thought of this issue, I encourage you to explore it further (within limits) for your thesis. I also don't see why IPCA should fix the concerns that you brought up (it's just a batch version of PCA, not a magical fix for everything). So to help us move forward in the most informed way, I think you should to spend at most one more day making a reasoned case (with evidence) for either...

Of the two solutions, I think implementing IPCA is a quicker, simpler and more surefire way to moveforward. So I am leaning toward that.

jeremymanning commented 7 years ago

OK-- so it sounds like you should move forward with IPCA. There's a hypertools implementation in this branch but it has not yet been integrated into master (it will be relatively soon, but isn't ready yet). Once you check out that code I think you can reduce the data with IPCA using: hyp.tools.reduce(data, ndims=v, model='IncrementalPCA').

You should make this switch to your code and then re-run the decoding analysis (for levels 0 -- 10, with 100 repetitions each) using synthetic data with different amounts of noise.

Once you verify that it behaves reasonably with synthetic data, you could switch to the analyses of real data.

TomHaoChang commented 7 years ago

I actually implemented stack in my own levelup function. Testing it on 1 repetition of the synthetic dataset gave really good results. Right now, I am doing a 100 repetition run to see how it performs generally. Will post results as soon as I am done

TomHaoChang commented 7 years ago

pieman decoding analysis:

sherlock decoding analysis:

forrest decoding analysis:

TomHaoChang commented 7 years ago

Hi Professor Manning, what do you think of the graphs for intra-level decoding analysis I posted above?

jeremymanning commented 7 years ago

Aren't those incorrect, per your post this morning?

TomHaoChang commented 7 years ago

These figures are from decoding analysis done on each level independently without any mixture, so the problem I had did not affect them. I think this analysis is mentioned in your first post, before the mixture analysis instructions:

Generate figures showing decoding accuracy as a function of "level" for several fMRI datasets. Suggestions:

Pie Man story listening dataset. Separately analyze three conditions (intact, paragraph, word, and rest) and make decoding-by-level figures for each condition. In our prior work we've found that decoding accuracy decreases as you move through those four conditions-- intact > paragraph > word > rest. (Details and intuitions are in the linked-to-paper.) Sherlock movie watching dataset. Show decoding accuracy by level. Raiders of the Lost Ark movie watching dataset. Show decoding accuracy by level.

jeremymanning commented 7 years ago

I guess I'm confused then...what was it that changed between this analysis and the other one? Were you optimizing something other than what is shown in this analysis? And are you now confident that the level labels for the above figures are correct and that you're using the right features to decode?

TomHaoChang commented 7 years ago

The problem I had previously was how I applied the weights when constructing the mixture model. I had applied the weights to the "across" timecorr results rather than the correlation of the "across" timecorr results and thus encountered problems while trying to optimize the level-mixture model.

The above figures are obtained from running decoding analysis on each level independently, without any level-mixture procedure nor optimization process, and so was not influenced by the error I had before.

The analysis done above followed the exact procedures you had mentioned in the instructions in your first post: calculating activations on all levels, randomly divide the subjects into two groups and find the isfc of each group at each level, use the two groups' isfc at each level to find the correlation at each level and finally conduct the decoding analysis at each level using the correlation. I think the problem I had before was not with the specific procedures on how to carry out decoding analysis, but rather with how I should apply the weights when mixing levels, so my code for this analysis should be correct. In addition, the results I obtained makes sense and even showed slight improvements from the results in your "A probabilistic approach to discovering dynamic full-brain functional connectivity patterns" paper, which I think is a good indication I am doing things correctly. Am I missing something?

jeremymanning commented 7 years ago

This sounds good, and I now understand what the issue you were describing is. Thanks for explaining!

TomHaoChang commented 7 years ago

Hi Professor Manning, no problem at all! I am almost done with the mixture analysis too. Will post the figures by tonight

TomHaoChang commented 7 years ago

Hi Professor Manning!

Here are the results for variance = 10 Pieman, variance = 10:

Forrest, variance = 10:

Sherlock, variance = 10:

TomHaoChang commented 7 years ago

Hi Professor Manning,

As can be seen from the graphs, accuracy peaks at level 1 instead of level 0 for some cases when variance = 10, but the accuracy was always the best at level 0 in the variance = min(1000, time_len) analysis. In addition, it seems we are able to achieve higher decoding accuracy in general for variance = 10 compared with variance = min(1000, time_len). From this phenomenon, we can conclude that different variances (resolution) will affect the decoding accuracy of different datasets differently).

In addition, after level 1, we can see the decoding accuracy decrease consistently as the level increases for all datasets. So at least from this intra-level analysis, we know that the variance (resolution) does not really affect the pattern of decoding accuracy with respect to level.

What do you think?

Thanks, Tom

jeremymanning commented 7 years ago

Interesting...could you try this with variance=1 also? Also, can you do the mixture analysis with variance=10 and also variance=1?

TomHaoChang commented 7 years ago

Hi Professor Manning!

Of course! I am currently running mixture analysis with variance = 10. Will carry about analysis on variance = 1 after I am done.

Thanks, Tom

TomHaoChang commented 7 years ago

Hi Professor Manning,

All the single level analyses using full datasets are posted in this thread. The multi-level mixture analysis and single level analyses results are in two different threads for better organization.

Thanks, Tom

jeremymanning commented 7 years ago

Thanks!

TomHaoChang commented 7 years ago

Hi Professor Manning,

I am working on adding more content to the Model Description section. However, I am a little lost as to how we should justify our application of PCA. I know the goal of using PCA is to help with storage scaling, but wouldn't using PCA on a correlation matrix destroy the inherent structure. Even if we are using the inverse squareform of the correlation matrix, wouldn't the ordering of the elements still have structural significance? How do we retain this correlation structure if we are repeatedly applying PCA in our level up process? Could you please help me understand the justification behind our use of PCA?

Thanks, Tom

jeremymanning commented 7 years ago

We use PCA as a low-dimensional embedding of the data (at each level). In other words, we want the low-dimensional representation of the data that is as close as possible to the original (high-dimensional) data. (Our choice of PCA specifically, versus some other dimensionality reduction technique, was somewhat arbitrary-- but PCA is quick to compute and has other nice features, like preserving as much variability in the data as possible given the number of dimensions we're saving.)

Further intuitions:

We start with the "raw" data from the scanner (V-dimensional). PCA reduces this to K dimensions, where K << V.
Applying timecorr yields an O(K^2)-dimensional vector at each timepoint, which we reduce down to K dimensions using PCA.

To verify that PCA is preserving temporal correlational structure in the data, you can compute the timepoint-by-timepoint correlation matrix-- in other words, compute how correlated the data at each timepoint is with the data at each other timepoint. You can do this before and after applying PCA. If PCA is preserving the temporal correlational structure in the data, those two timepoint-by-timepoint correlation matrices should look similar. If PCA is losing critical information, those two timepoint-by-timepoint correlation matrices should look different.

TomHaoChang commented 7 years ago

Hi Professor Manning,

I think I get your intuition in choosing PCA, but I am not sure I am following your instructions for testing if the temporal correlation structure is preserved. Given a matrix of TxV dimensions, we can use PCA to reduce it to Txv dimensions. However, the resulting correlation matrix for the original data would be TxVxV, and Txvxv for the PCA reduced matrix. I am unsure how we can make comparison between the two...do we apply PCA again to make them into the same dimensions?

Thanks, Tom

jeremymanning commented 7 years ago

The original data is T by V. Let X1 by the T by T correlation matrix of your original data.

The PCA reduced data is T by K. Let X2 by the T by T correlation matrix of the PCA-reduced data.

I'm suggesting that you compare X1 and X2.

Further, for each level you get a T by O(K^2) matrix (prior to reducing with PCA) and a T by K matrix (after reducing with PCA). You can (analogously) compute a T by T matrix before vs. after reducing the data (at the given level) with PCA, and compare those two matrices (for the given level).

TomHaoChang commented 7 years ago

Hi Professor Manning,

I thought we were calculating voxel-wise correlation and also PCA reducing along the voxel direction...so wouldnt X1 by V by V, and X2 be K by K?

Thanks, Tom

jeremymanning commented 7 years ago

I'm suggesting that you do an additional analysis to specifically test what you were concerned about. You suggested that applying PCA removes important structure from the data. If so, you can do an analysis to test whether that is indeed the case. I've proposed one test above: test whether the timepoint-by-timepoint correlation matrices are preserved (at each level) before vs. after reducing the data using PCA.

ContextLab / timecorr

Analyses of fMRI datasets #5