ContextLab / quail

A python toolbox for analyzing and plotting free recall data
http://cdl-quail.readthedocs.io/en/latest/
MIT License
20 stars 10 forks source link

Temporal clustering #58

Closed andrewheusser closed 7 years ago

andrewheusser commented 7 years ago

@jeremymanning : when you have a moment, could you take a look at the code, specifically here: https://github.com/ContextLab/quail/blob/dd27373dc27670c7ca806d118a6b82337805e241/quail/analysis.py#L355-L412

The tests I wrote (recall same as encoding, recall flipped relative to encoding, both =1) are working, but I'd appreciate it if you could check out how I implemented it.

jeremymanning commented 7 years ago

sure-- i can get to this today, but probably not for a few more hours

andrewheusser commented 7 years ago

@jeremymanning just a reminder, when you get a moment! :)

jeremymanning commented 7 years ago

the code overall looks like it'll give the correct answer. however, i have a couple of suggestions.

first, the efficiency could be improved slightly. in the current version, you sort the distances and then compute the index of the recall, and then compute the mean across all recalls. i think a more efficient version would be to compute the proportion of recalls with a larger distance than the observed transition. this can be done in one step; e.g. lines 394--399 could be replaced with weights.append(np.sum(dist_vec > np.abs(c-n)) / len(dist_vec).

second, it's a bit messy to have tempclust_helper be a separate function from fingerprint_helper. a more efficient overall organization would be to separate out a compute_distances function from the fingerprint computations, and have fingerprint_helper take in a distance matrix. (the dist_funcs should be passed to compute_distances, not fingerprint_helper.) the distance matrix for temporal clustering is scipy.linalg.toeplitz(np.arange(list_length)).

andrewheusser commented 7 years ago

awesome, thanks for the feedback! I totally agree about consolidating the code, and will refactor it. To contextualize this, under the time pressure, it was easier for me to implement a separate helper than it was to restructure the fingerprint code, so that's what I did. I'll redo it and submit a PR

andrewheusser commented 7 years ago

@jeremymanning other than this, I think a "lab/close friends release" (0.1.0) is ready to go. Do you want me to refactor the fingerprint/temporal clustering code before the release?

jeremymanning commented 7 years ago

I think it'd be worth refactoring the code prior to sharing it, in case anyone decides to modify or add on to our code. We don't want them to be working with an out-of-date design when they put in the work to either understand how the code is organized or modify it.

The other major piece that I think we will want in place before sharing is a more robust data parsing and loading framework. E.g. we should be able to load in a few sample datasets easily, e.g. most of the datasets from here.

andrewheusser commented 7 years ago

👍 this sounds good. I'll add these issues to the 0.1.0 release and get to them soon.

andrewheusser commented 7 years ago

@jeremymanning I think I'm very close with this, but after refactoring, I can't get these weights to come out right. I'm sorry to ask you to do this, but if you have a chance can you look over this code and let me know if you see anything wrong with it: https://github.com/ContextLab/quail/blob/temporal-clustering/quail/analysis.py#L136-L250

Otherwise, I will just pick this back up in the morning after I meet with Sam.

andrewheusser commented 7 years ago

One more piece to this is that for some features (binary in particular), there are multiple items that share the same distance with a given item. For example, for size (bigger or smaller than a shoebox) if a subject transitions from a small item (n) to another small item (n+1), the distance would be 0 but the distance from n to many other items (all other small items on the list) would also be 0. This makes things a little complicated when calculating the rank.

To deal with this, we take the average percentile rank of all items that share the lowest distance. One result of this choice is that the clustering score for these categorical variable will be biased towards .5, since the average percentile rank will typically be >0 and <1 (I think). Is this right?

Along with correcting for list length effects, it seems that a permutation procedure would also help with this issue. Let me know if I am off base on this