Multi-level Mixture Analysis

TomHaoChang commented 7 years ago

I'm also flagging an alternative version of this analysis here for possible exploration, as a different way of exploring decoding accuracy by level. It could be that, even if data at higher levels didn't by itself lead to better decoding accuracy, mixing in information at higher levels could still improve the decoding. In other words, a mix of level 0 and level 1 might be better than level 0 alone, even if level 0 out-performs level 1.

To test this more formally, I suggest the following:

split the dataset into 2 parts, A and B. further subdivide each of those into two-- so we have A1, A2, B1, and B2. compute the level 0, 1, 2, ..., 10 representations of the data for each of those 4 subset of the data (using the "across" version). So for each subset, for each level, we should have one timepoints-by-features matrix. using A1 and A2, find the vector of mixing proportions that maximizes decoding accuracy (in other words, find the mix of level 0, level 1, level 2, etc. produces the best decoding accuracy). using this mixing proportion from A1 and A2, compute decoding accuracy for B1/B2. repeat the above 100 times (with different random assignments of A1/A2/B1/B2 each time) to get a distribution of decoding accuracies. Then plot:

the average optimal mixing proportions across those 100 runs (as a bar graph or violin plot) the average decoding accuracy for B1/B2 across those 100 runs, plotted as a bar graph or violin plot. Also compare the decoding accuracy to the decoding accuracy obtained using only level 0 data.

After thinking through this process a little bit more, I have the following questions that I would like to clarify:

What is the exact argument we are trying to make with this analysis? What combination of data at each level will give us the best decoding accuracy? Why do we need to find this combination, as it does not tell us anything about the quality of the dataset nor about the level of noise at each timepoint...I am having some difficulty grasping the goal of this analysis
What will be the target of our optimization? I have been thinking about the equation we are trying to optimize, and since it's linear sum of the different levels, wouldn't gradient descent converge toward putting the most weight on the level with the highest decoding accuracy to begin with? Or maybe the model will converge toward the weight distribution that would make all the subjects look the most similar? In either case, I am not sure what kind of conclusion we can extrapolate from the results....
Will the optimal weight distribution for one dataset be the same for all datasets? I highly doubt this as in our optimization we really have no way of enforcing generalization...

In the end, I am just a little bit confused as to what purpose this branch will contribute to...could you give me a better picture of your vision? Thanks!

jeremymanning commented 7 years ago

What is the exact argument we are trying to make with this analysis? What combination of data at each level will give us the best decoding accuracy? Why do we need to find this combination, as it does not tell us anything about the quality of the dataset nor about the level of noise at each timepoint...I am having some difficulty grasping the goal of this analysis

The point is this: although representations of the data at higher levels may not (by themselves) lead to better decoding than the raw data, it could be the case that those higher order patterns contain different information beyond the raw data. This analysis allows us to explore whether mixing together predictions from features of different levels is better than just using any one level in isolation.

What will be the target of our optimization? I have been thinking about the equation we are trying to optimize, and since it's linear sum of the different levels, wouldn't gradient descent converge toward putting the most weight on the level with the highest decoding accuracy to begin with? Or maybe the model will converge toward the weight distribution that would make all the subjects look similar? In either case, I am not sure what kind of conclusion we can extrapolate from the results....

We're optimizing decoding accuracy on the training set. It's possible that the optimum could be to put all of the weight on one level. But to the extent that different levels carry different information (or even to the extent to which noise is independent across levels) it could be beneficial to create mixes of multiple levels.

Will the optimal weight distribution for one dataset be the same for all datasets? I highly doubt this as in our optimization we really have no way of enforcing generalization...

That's something we'll want to test!

In the end, I am just a little bit confused as to what purpose this branch will contribute to...could you give me a better picture of your vision? Thanks!

The purpose is to have a more sensitive test of whether higher order information is useful for decoding.

TomHaoChang commented 7 years ago

The purpose is to have a more sensitive test of whether higher order information is useful for decoding.

This makes sense! However, taking it a bit further...how does a higher decoding accuracy help us? I know that decoding accuracy helps us identify which timepoints are more task related or in general the quality of the dataset. But I fail to understand why we want to optimize the accuracy...in other words how will obtaining a mixture of data at multiple levels help us understand the brain better?

jeremymanning commented 7 years ago

how will obtaining a mixture of data at multiple levels help us understand the brain better?

Each level reflects a different aspect of neural activity:

level 0: "raw" neural patterns recorded from the scanner (i.e. patterns of activity throughout the brain-- a reflection of what each brain structure is doing at each moment)
level 1: correlations between the activities of different brain structures
level 2: correlations between correlations between activities of different brain structures
etc...

So knowing which level of activity leads to the best decoding accuracy is really telling us: how is the information about moments in a movie or story represented in the brain? In other words, is each moment reflected in a single pattern? Correlations between patterns? Higher order dynamics?

TomHaoChang commented 7 years ago

Each level reflects a different aspect of neural activity:

level 0: "raw" neural patterns recorded from the scanner (i.e. patterns of activity throughout the brain-- a reflection of what each brain structure is doing at each moment) level 1: correlations between the activities of different brain structures level 2: correlations between correlations between activities of different brain structures etc... So knowing which level of activity leads to the best decoding accuracy is really telling us: how is the information about moments in a movie or story represented in the brain? In other words, is each moment reflected in a single pattern? Correlations between patterns? Higher order dynamics?

Mhmm, I think I understand this part. The decoding accuracy at each level does tell us which level of activity (and thus which order dynamics) best reflects the brain activities. However, unlike the single level, which we can pin specific meanings to (raw data, correlation of raw data, correlation between correlation of raw data, etc), I am not sure what kind of meaning we can assign to a mixture of each level...In other words, what is the practical meaning of a mixture of levels...

TomHaoChang commented 7 years ago

Hi Professor Manning, I have been thinking about how I should run optimization for this analysis. There are two major problems:

The sum of the weights has to be 1
Decoding with ISFC runs very slowly for datasets with many timepoints (sherlock and forrest)

Therefore I propose we use a special version of gradient descent that's tailored toward our situation. At each iteration, we:

Calculate the gradient with respect to each element in the weights array to form the gradient array g_0. We then standardize the gradient array to have a sum of 0.1
Modify the original weights array w_0 through the following process w_1 = 0.9 w_0 +g_0

This algorithm has two major advantages:

The sum of the weights is always guaranteed to be 1
As each gradient makes up 10% of the new weights matrix, we can achieve significant optimizations to our weights matrix within 10 iterations (10 * 11 derivatives = 110 runs of isfc) of the algorithm, which will probably take around 10 hours

With 100 repetitions of this algorithm using different assignments of A1/A2 and B1/B2, I think we will be able to get a pretty good idea on the distribution of the optimal mixing proportions. What do you think?

jeremymanning commented 7 years ago

Re: the decoding running slowly, I think the key is to pre-compute the timepoint-by-timepoint correlation matrices for each level that decoding is based on. The only thing that changes for different values of the mixing proportion vector (w) is how much of each correlation matrix is mixed in to get the final weighted average correlation matrix. So the full sequence would go something like this: 1) Compute each subject's "within" features for each level (this should already be done) 2) To be repeated 100 times, each run in parallel on a separate discovery node:

Compute A1/A2/B1/B2 random group assignments
Compute "across" timecorr for each level for A1, A2, B1, and B2
Compute correlation matrices for each level for A and B (i.e. for group A, correlate A1's features at each timepoint each A2's features at each timepoint; similarly for group B). This yields one correlation matrix for group A and another for group B, for each level (including level 0). So if we go up to level 10, this would be 11 correlation matrices for A and another 11 for B. All of the matrices should be z-transformed (r2z). Let's call them z_A0, z_A1, z_ A2, ..., z_B0, z_B1, z_B2, ..., z_B10.
After computing these z-transformed correlation matrices, use least squares optimization (on the group A correlation matrices) to compute the weighting matrix w to be used for decoding-- i.e. decoding_matrix_A = z2r(w[0]*z_A0 + w[1]*z_A1 + w[2]*z_A2 + ... + w[10]*z_A10). In other words, you're trying to find the w that maximizes the decoding accuracy using decoding_matrix_A.
Now use the same z2r(<weighted z-transformed sum>) formula to compute decoding accuracy for group B
Save out w and the group B decoding accuracy 3) Collect the w and group B decoding accuracies across the 100 runs and summarize them in a figure

With the above pipeline, the compute-intensive part of the algorithm (computing the "across" matrices for each group/level) happens outside of the optimization-- so it should be relatively fast.

The other tricky piece of this, as you noted, is that the w matrix is constrained to sum to 1:

I think your idea could work! However, it may be very slow.
Alternatively, you could use one of the scipy optimization algorithms and the modify the output. For example, you could use constrained optimization (with each element of w constrained to be between 0 and 1, for example) and then (now that w would no longer sum to 1):
- To compute the accuracy that's being optimized, for a given setting of w, use: decoding_matrix_A = z2r((w[0]*z_A0 + w[1]*z_A1 + w[2]*z_A2 + ... + w[10]*z_A10) / np.sum(w))
- At the end, instead of returning w, return w_norm = w / np.sum(w).
- Then use w_norm to compute group B's decoding accuracy based on decoding_matrix_B = z2r(w_norm[0]*z_B0 + w_norm[1]*z_B1 + w_norm[2]*z_B2 + ... + w_norm[10]*z_B10)

100 iterations of the least squares optimizer might not be enough, but I think what I'm describing should be fast enough that you can run it up to a reasonable specified tolerance (e.g. 1e-5).

TomHaoChang commented 7 years ago

Compute "across" timecorr for each level for A1, A2, B1, and B2 Compute correlation matrices for each level for A and B (i.e. for group A, correlate A1's features at each timepoint each A2's features at each timepoint; similarly for group B). This yields one correlation matrix for group A and another for group B, for each level (including level 0). So if we go up to level 10, this would be 11 correlation matrices for A and another 11 for B. All of the matrices should be z-transformed (r2z). Let's call them z_A0, zA1, z A2, ..., z_B0, z_B1, z_B2, ..., z_B10.

Regarding this process, I just want to clarify that we are summing the "across" timecorr results not the "within" timecorr results? I think if we precalculate the "across" timecorr at each level, this method would indeed yield low runtime. However, one concern I want to run by you is again the focus of this analysis. I thought we were trying to identify the usefulness of the activations at each level by calculating the decoding accuracy through different mixtures of activations at each level. If we conduct the mixture step on "across" timecorr correlations, are we still finding the same thing? Another thing I want to confirm is that, are mixture at the "across" timecorr level equivalent to mixture at the "within" timecorr level? I can't seem to wrap my head around this. Any clarifications would be super helpful!

After computing these z-transformed correlation matrices, use least squares optimization (on the group A correlation matrices) to compute the weighting matrix w to be used for decoding-- i.e. decoding_matrix_A = z2r(w[0]z_A0 + w[1]z_A1 + w[2]z_A2 + ... + w[10]z_A10). In other words, you're trying to find the w that maximizes the decoding accuracy using decoding_matrix_A.

I am unsure how least squares optimization would work in this scenario...

jeremymanning commented 7 years ago

Regarding this process, I just want to clarify that we are summing the "across" timecorr results not the "within" timecorr results? I think if we precalculate the "across" timecorr at each level, this method would indeed yield low runtime.

Yes, you're pre-computing the "across" timecorr for each level.

However, one concern I want to run by you is again the focus of this analysis.

I've explained this above and provided a reference-- please read my above comments and the reference I linked to. The summary is that it's a more sensitive analysis then decoding separately for each level. The intuition for what mixtures tell us is: if a combination of level 0 + level 1 provides better accuracy than level 0 or level 1 alone, that tells us that level 0 and level 1 contain partially non-overlapping information.

I am unsure how least squares optimization would work in this scenario...

Which aspect of this approach are you unsure about? Are you unsure of the metric you're optimizing, or some other piece of the optimization procedure?

TomHaoChang commented 7 years ago

I guess the main thing that I wanted to confirm is: Given activation matrices A1, A2 with correlation R_A for one level, and activation matrices B1, B2 with correlation R_B at another level, and a weight array of w = [w_1, w_2]. Is the correlation between w 1*A1+w_2*B1 and w_1*A2+ w_2*B2 equivalent to R = z2r(w_1*r2z(R_A)+w_2*r2z(R_B)). More intuitively, does the weighted sum of correlations reflect the correlation of the weighted sum of the activations...I guess I am not really familiar with fisher z-transformation and its underlying functionalities and couldn't really find any good reference material online, so it would be great if you could give me some pointers here.

My main confusion with least squares optimization for this problem is I am unsure how to set up the equations. In my understanding of the least squares approach, we are supposed to set up an equation of the form Y = (WX-B)^2 and we find the W that minimizes Y through W = (X^T X)^-1 (X^T B). In our framework, W is the weight matrix and X is the activations at each level. However, I am not sure what I should set B as. Unless I am totally misunderstanding what you mean by least squares optimization...

In order to maximize decoding accuracy, we want to ensure that the diagonal of our decoding matrix contains the maximum value in each row. How do we pose this as an optimization problem? In addition, I just realized max() is a non-continuous function, so the idea I proposed might now work since the gradient is only non-zero at very few places...

jeremymanning commented 7 years ago

Is the correlation between w 1A1+w_2B1 and w_1A2+ w_2B2 equivalent to R = z2r(w_1r2z(R_A)+w_2r2z(R_B)). More intuitively, does the weighted sum of correlations reflect the correlation of the weighted sum of the activations...I guess I am not really familiar with fisher z-transformation and its underlying functionalities and couldn't really find any good reference material online, so it would be great if you could give me some pointers here.

The weighted sum of correlations does not reflect the correlations of the weighted sum of the activations. The Wikipedia article on z-transformations is a good place to start. We use the z-transformation in computing ISFC ("across" timecorr) as well. It's a way of averaging correlations that is more stable and less biased than averaging the raw correlations.

Here's another way of thinking about the combined correlation matrix. Each level's correlation matrix tells us about how well the data representation at that level can be used to decode. We're not trying to average the representations at different levels; rather we're trying to understand whether a mix of those correlation matrices (i.e. a mix of decoding information from different levels) is better than the individual levels' matrices alone.

My main confusion with least squares optimization for this problem is I am unsure how to set up the equations.

Take a look at the link to the scipy optimizer I had sent above. Here's a tutorial: [link]. To use the scipy optimizer, you just need a function to minimized (or maximized), a set of constraints (if you do constrained optimization as I'm suggesting to ensure each element of w is between 0 and 1, inclusive), and a starting point (which I'd recommend as something like 0.5*np.ones([1, 11])). The to-be-minimized function can be whatever you want it to be (in this case the function is: given a weight vector, return -1 times the decoding accuracy).

You're correct that the gradient won't be smooth or convex-- so that means we lose any guarantees about finding global optima, and we can only hope to find local optima.

TomHaoChang commented 7 years ago

1e-5 tol is taking more than 15 hours to run....i am afraid that we can't finish before the 18th. Should we decrease tol to 1e-3?

jeremymanning commented 7 years ago

That's fine... But did you run any analyses on synthetic data to show that the more liberal threshold would finish?

TomHaoChang commented 7 years ago

I ran on both 1e-5 and 1e-3 tols, and both were able to finish on the synthetic dataset and give good results. I think this is because the synthetic dataset is relatively small. But I have not tried 1e-3 on real datasets

jeremymanning commented 7 years ago

How much faster was the 1e-3 version?

TomHaoChang commented 7 years ago

It's really hard to quantify as the speed difference vary greatly with dataset size and inherent noise. However 1e-3 does generally require less iterations when compared with 1e-5, so i am guessing there will be significant difference in runtime

jeremymanning commented 7 years ago

Sounds good, I'll leave this up to you...

TomHaoChang commented 7 years ago

Okay, I am going to run the 1e-3 version to get some data to work on for next steps, and then run the 1e-5 version while I finish everything else

TomHaoChang commented 7 years ago

It's been 80+ hours and the analysis is still running even with tol at 1e-3. Should I instead set the maximum number of iterations to 1000?

jeremymanning commented 7 years ago

Copying from email:

Did you set up the analysis so that the correlation matrices for each level are pre-computed? How long does it take to evaluate the error function that is being minimized? I'm trying to get a sense of whether it's each iteration that takes a long time, or if the parameter space is especially bumpy/unstable.

One way to determine whether setting a fixed number of iterations is a reasonable approach would be to plot the error function vs. the number of iterations. If the error function flattens out after 1000 iterations, then that's probably enough. If it's still rapidly decreasing after 1000 iterations, then you know you need more.

TomHaoChang commented 7 years ago

Turns out there are nans in the transformed correlation matrices, so the optimization gets stuck after a few iterations. Still looking into this

jeremymanning commented 7 years ago

r2z(1) will be infinity, and z2r(infinity) is nan. So if there are 1s (or -1s) in your correlation matrices, that will lead to nans; you should look for nans and correct (or remove) them.

that being said, hyp.tools.reduce is set up to remove nans-- so if you switch to using hypertools to do the dimensionality reduction (which i would recommend!) you shouldn't have this issue...just make sure you have the latest version of hypertools (pip install --upgrade hypertools) so that it uses IPCA by default, rather than PCA.

TomHaoChang commented 7 years ago

I checked activations at all levels for all the datasets and there does not seem to be NaNs at the activations level nor in the correlations for all levels.

However, it seems I am missing files for some repetitions of the random group division/ISFC process before weights optimization. I am in the process of regenerating this step. In addition, I found a bug in my code where I apply r2z twice, which was the reason for the nans.

Regarding the speed issue, it seems every iteration of the optimization takes a very long time to run and is simply caused by the sheer size of the datasets (I tested the code on a small portion of the actual dataset and it runs really fast), so I am not sure there's an easy way to circumvent this

TomHaoChang commented 7 years ago

However, after fixing the nan problems, the program no longer gets stuck so it now finishes running relatively quickly (5-10 mins). I will redo the level-mixture analysis after I regenerate all_level_ISFC of random divisions.

In addition, I have been observing the entire optimization process for the all level ISFC of one repetition of random group division. It seems that the most weight is always put on the raw level (~1) and that any mixture of subsequent levels only serves to decrease the decoding accuracy. Here is an example optimization result: Training: weights: [ 0.9928229 , 0.38104045, 0.00434233, 0.27960215, 0.04644104, 0.18054706, 0.13902259, 0.08092787, 0.21337389, 0.00267682, 0.16977964] deocidng accuracy: 0.033333333333333333

Testing: weights: [ 0.39862661, 0.15299756, 0.00174685, 0.11226177, 0.01865203, 0.07249771, 0.05580802, 0.03249145, 0.08566448, 0.00108512, 0.06816839] decoding accuracy: 0.016666666666666666

The discrepancy in weights for training and testing is due to normalization w/np.sum(w)

jeremymanning commented 7 years ago

First, you should be normalizing the weights for both the training and test data, not just the test data. Second, I don't understand why these accuracies are so much lower than with the non-optimized data. I'm thinking something may be off.

As a test, can you try a 50-50 mix of the Level 0 and Level 1 correlation matrices and report the decoding accuracy? (Which dataset are the above numbers for?)

TomHaoChang commented 7 years ago

Okay, I will do that. I exceeded disk quota again, so I can't modify my code nor run any new programs. I emailed John already and am waiting for his response. Will respond asap once I am done

TomHaoChang commented 7 years ago

The results are for pieman-intact

jeremymanning commented 7 years ago

Here's the decoding accuracy I've found using a sliding window version of that analysis:

decoding_accuracy_mix.pdf

So I'd expect to do about that well (i.e. around 14% for the intact condition).

TomHaoChang commented 7 years ago

Hi Professor Manning, what are the weights with which you were able to achieve the above accuracies? Maybe I can use your weight optimization result to figure out what I might be doing incorrectly. In addition, it looks like the mixture-level decoding accuracy is generally less than raw data decoding accuracy and the first level decoding accuracy, which is consistent with the optimized weight results I obtained which put the most weights on the 0th and 1st level. How should we make a point that the higher order information is contributing to the results? Our original projections were that the mixture level analysis would give us better results than the raw level right?

In addition, you mentioned that I should do a violin plot of the results after I obtain the optimal weights and decoding accuracy distribution? What other analysis should I do? I want to get everything prepared while I wait for John's response.

Lastly, I am currently unable to modify my code nor store the outputs of my programs on Discovery because I have exceeded my disk quota. I sent John an email Friday night about this problem, but I don't think I will be able to hear back from him until Monday. However, I have already completed the code for multi-level mixture analysis, so once he resolves the server issues I will be able to complete the analysis within the day.

jeremymanning commented 7 years ago

As a test, can you try a 50-50 mix of the Level 0 and Level 1 correlation matrices and report the decoding accuracy? (Which dataset are the above numbers for?)

^ those are the weights-- 0.5 on level 0 and 0.5 on level 1.

TomHaoChang commented 7 years ago

I did some testing and finally found the problem that's causing low mixture accuracy....

First of all the decoding accuracy for each level of pieman-intact starting from raw data is: [ 0.19667, 0.10667,0.01333,0.01,0.0033,0.0033,0.00667,0.00333,0.00666,0.00333,0.]

When I combined the activations from level0 and level1 at weights 0.5 and 0.5, I achieved accuracy of 0.12334

For the optimization setup, I calculated "across" timecorr for each level and achieved the following results:

When optimizing over all levels, the accuracy is around 4-5% with most weight focused on the lower levels
When optimizing over only the first two levels, the accuracy is a little over 5% with most weight focused on the 0th level
When weights of 0.5 and 0.5 are used for the first two levels, I obtain accuracy of around 2%

After careful investigation, I think the main factor causing the difference between our results is the way I structured "across" timecorr and how I applied the weights. When "across" timecorr is applied, the isfc of the activations is calculated, so the output data is to some extent one level higher than the input. The most prominent effect is on the raw data: decoding accuracy for level 0 is around 20%, but with "across" the accuracy goes down to a little over 10% (which is very close to decoding accuracy at level 1). So I propose adding a level of mean raw activations before the raw data "across", which I think was how we implemented intra-level decoding analysis. To illustrate:

Our current setup-->level0 "across", level1 "across", etc New setup-->mean of level0, level0 "across", level1 "across", etc

As mean of raw activations will not be contained within the -1 to 1 range, we will need to modify the way we apply the weights to the "across" data. We can either find a way to standardize the mean raw activation (division by the largest absolute value in the matrix) or abandon Fisher Z-transformation. What do you think?

(P.S. I have been working my way through the references I have found for the introduction section. Will post a list in the other thread by tonight.)

jeremymanning commented 7 years ago

It sounds like your level 0 is really level 1, etc. Level 0 should be based on the PCA-reduced raw data, with no "across" timecorr applied. Level 1 should be "across timecorr" applied to level 0. Level n should be "across timecorr" applied to level n-1. So you should still run the 50-50 mix of level 0 and level 1 features, after you correct this. If you get higher decoding accuracy than you get with level 0 or level 1 alone (which is what I've found previously), then that would suggest a problem with the optimization procedure-- i.e. it's not finding an optimal solution.

TomHaoChang commented 7 years ago

Okay, got it. However, what's the best way to combine this raw data with the "across" timecorr of other levels? Raw activation data occupy a very wide range of values, which exceeds the -1 to 1 range of Fisher Z-transformation

jeremymanning commented 7 years ago

I don't understand why that's an issue... Aren't you computing correlations

TomHaoChang commented 7 years ago

When we calculate correlation, raw data (level 0) becomes level 1, which is why I couldn't get as high of an accuracy as you have found. I think we want to include both information from the raw data as well as correlation from the raw data, unless I am misunderstanding...

For decoding analysis, we had separate analysis for raw data (level 0) and raw data correlation (level 1), so I am thinking that we should do the same here...

jeremymanning commented 7 years ago

Right

TomHaoChang commented 7 years ago

If we use raw data as level 0, then we can't use Fisher-Z transformation to combine all the levels. There are two scenarios we are talking about I think:

Level 0 being the "across" of raw data, which is what I am doing now and yields low accuracy
Level 0 being raw data, making it difficult to combine with with other levels, which are correlations. However, this version will give us higher accuracy, and will match your foundings.

If we want to go with the second version, we have to find a good way to normalize the raw activations....

TomHaoChang commented 7 years ago

Hi Professor Manning, sorry about my confusion before. I made the mistake of applying weights to the timecorr "across" results instead of applying weights to the correlation of timecorr "across" results, as you had described in your instructions in another thread. I was able to resolve this problem today and have achieved some interesting results in my initial testing.

First of all, decoding accuracy seems to decrease significantly when we decrease the number of subjects. e.g. raw activation decoding accuracy for 36 subjects is around 21% but only around 12% for 18 subjects. Similarly, level 1 decoding accuracy for 36 subjects is around 11% but only 7% for 18 subjects.

Secondly, when I optimize the level mixture model, I was able to obtain a mixture accuracy that's significantly higher than both raw data and level 1 accuracy. As we are dividing the subjects into group A and group B, the decoding accuracy of raw data and level 1 should be around 12% and 7% as mentioned in the previous paragraph, but I was able to achieve 20% training accuracy and 15% testing accuracy with the mixture model. In addition, the weights were evenly distributed throughout all the levels.

Furthermore, I tested my method by using a mixture of only the first two levels (raw data and level 1) and was able to achieve 17% training and 15% testing accuracy. To verify that my optimization process is working correctly, I conducted testing with the initial weight fixed to [1,0] and [0,1], and obtained the same optimization result in both scenarios.

What do you think? Should I carry through with this analysis on the entire dataset? Or is there something wrong?

jeremymanning commented 7 years ago

This sounds promising-- can you post some figures?

TomHaoChang commented 7 years ago

The results from above are just from a few manual testing. I can proceed to conduct analysis on all the datasets and post the figures shortly.

TomHaoChang commented 7 years ago

Hi Professor Manning,

Here is the multilevel mixture analysis violin-plot for pieman-intact.

The accuracy from the mixture model is significantly higher than the individual levels.

Here is the weights distribution figure for pieman-intact.

The weights are very evenly distributed among all the levels with a slightly higher emphasis on level1

TomHaoChang commented 7 years ago

I think the violin plot is a good choice for weights distribution, but it becomes more like a box plot for the decoding accuracy plot. In addition, I can't really think of a good way to compare across different levels of cognitive salience. Would an error bar plot like the one I made for intra-level decoding analysis be a better option?

jeremymanning commented 7 years ago

Interesting! You could do a grouped plot to compare across the different pieman conditions.

TomHaoChang commented 7 years ago

Hi Professor Manning,

I will do that. Should I also do the same for weights distribution across different pieman conditions?

jeremymanning commented 7 years ago

Yeah, that sounds good

TomHaoChang commented 7 years ago

Hi Professor Manning,

Here are the figures for pieman:

Multilevel Mixture Analysis:

Weights distribution:

Figures for Forrest and Sherlock:

TomHaoChang commented 7 years ago

Hi Professor Manning,

I am not sure why the violin plot shows the weights going below 0. I checked all the optimized weight matrices and was able to verify that everything is positive.

Do you think this analysis is sufficient? I think the weights distribution figure and decoding accuracy figure show that the 1st level (raw data isfc) contributes the most to decoding, but a mixture of all the levels gives significantly higher accuracy than any individual level alone and proves your hypothesis:

Each level's correlation matrix tells us about how well the data representation at that level can be used to decode. We're not trying to average the representations at different levels; rather we're trying to understand whether a mix of those correlation matrices (i.e. a mix of decoding information from different levels) is better than the individual levels' matrices alone.

This phenomenon is consistent across all the datasets, which makes the result generalizable. What do you think?

Thanks, Thomas

jeremymanning commented 7 years ago

I think you should run the weights optimization analysis another time, but with a much smaller variance-- e.g. instead of using a variance of 1000, try a variance of 10 or 100 (or, ideally, both).

What this will test is whether the patterns showing up as important (i.e. the levels that are given large weights) are truly correlation based, or whether level 1 just happens to have the right timescale. If the "best" level changes with the variance parameter, that could indicate that we should frame the results differently. If the results don't change with the variance parameter, that would indicate that the results are framed well (insofar as we have thought of all relevant confounds and alternatives).

TomHaoChang commented 7 years ago

Hi Professor Manning,

Right now I am not using 1000, but rather a variance parameter equal to the minimum between the dataset time length and 1000, so it's a little different between pieman, sherlock and forrest. I can fix the variance to be 10 instead for a strong contrast and see what happens.

However, I am unsure what you mean by timescale. In addition, changing the variance parameter would also effect all non-raw-data levels, so I am not sure what kind of conclusion we can infer from having different results. If we are trying to decrease the correlation recovery quality of the best level to infer if our pattern is correlation based, could we replace the 1st (best) level with a random matrix and see what happens? This way we can keep the other levels constant and have some control over our conclusion...what do you think?

Thanks, Tom

jeremymanning commented 7 years ago

However, I am unsure what you mean by timescale. In addition, changing the variance parameter would also effect all non-raw-data levels, so I am not sure what kind of conclusion we can infer from having different results.

The level 0 (raw) data will be unaffected by the variance parameter. However, as you move up each subsequent level, the previous level's data are effectively being smoothed out by our applying a Gaussian kernel with the given variance. For level 1, we're smoothing the level 0 data. For level 2, we're smoothing the level 1 data-- so effectively level 2 has been smoothed twice (once to move from level 0 to level 1, and a second time to move from level 1 to level 2). And so on. This means that as we move up each level, the "effective" temporal resolution gets coarser and coarser.

In our prior explorations, we had found that setting the variance to around 1000 (or the number of timepoints, whichever is smaller) was a good heuristic that worked well for several different synthetic datasets, for recovering the level 1 patterns.

But suppose that the real representations we were looking for were level 2. Then we'd want the level 2 data to have an effective smoothing (relative to the raw data) roughly equivalent to if we had applied smoothing with a variance of min(1000, T) to the raw data. Instead, in our current analysis, each time we go up a level we smooth out the previous level data by a factor of min(1000, T).

TomHaoChang commented 7 years ago

Hi Professor Manning,

Thank you for your explanation! I think I get what you mean now. Should I go ahead with a variance of 10, since we can effectively move the smoothing up by 2 levels this way? A variance of 100 won't cause a distinct temporal resolution level shift (100*100=10,000>1000), as the shift would end up between two levels so I am not sure it would do a good job of highlighting the problem we are having.

BTW, what do you think of the results from the single-level decoding analysis in the other thread (Analyses of fMRI datasets). Should I go ahead and close the issue?

Thanks! Tom

ContextLab / timecorr

Multi-level Mixture Analysis #18