Using hierarchical indexing/MultiIndices to put timepoints of timevarying signals on individual rows?

raeedcho commented 3 years ago

Pandas seems to have a lot of useful indexing tools and time-series tools that might be able to make some of our work easier for the nitty gritty aspects of processing neural data. In particular, because all time-varying data streams are already aligned and synced to each other, I think it might make sense to have each time point be a row of a DataFrame, with a multi-index selecting for (monkey, session_date, trialID, time_into_trial).

Setting it up this way would allow for some really easy trimming or multi-trial concatenation (simply filtering or slicing), binning or averaging over trials (aggregation over rows), and probably several other low-level signal manipulations that are simply built into pandas. We could then also store idx... fields as timestamps or as TimeDeltas, since pandas allows some really clever indexing by time.

I think the main cost for this implementation would be that we might either have to replicate the trial information (e.g. trial result, target direction, etc.) across all rows of the now very tall DataFrame, or we would want to keep a separate table of trial information that we then use joins with the time-varying data to select things (this second way feels more like the "right" way to do things, but then we'd have to keep track of two different objects).

We might also have to develop a way to realign trial timestamps according to different reference timestamps (e.g. trial start vs. go cue vs. movement onset). Alternatively, we could have different columns according to TimeDelta from each of the events we care about, and just re-index the time-varying DataFrame every time we want to "realign" the data.

I've been messing around a little bit with trying this format out, but before I go too far down the rabbit hole, what do you all think?

bagibence commented 3 years ago

Hmm, in short I think this would complicate things and I wouldn't do it. Unfortunately, I'm pretty sure it would end up making selecting things harder in many ways. While having np.ndarray fields inside the dataframes is not very "tidy", I think it's an important part of the design that each row/unit is a trial. Replicating things even more would be messy and take up more space.

I'm a fan of xarray and often use it after getting data out PyalData (like keeping track of dimensions of get_sig_by_trial gives me). What I've been always tempted to do -- that I guess is similar to your suggestion in spirit -- is rely a bit more on DataArrays. We decided for storing and working with signals as numpy arrays as the transformations we include in the package are fairly simple. I would stick to storing data as plain arrays but we can think about using xr.DataArray in the analysis code. It would make some of the code clearer (e.g. having an explicit time dimension) but I'm sure sometimes it would also introduce complications (like sklearn models expecting np.ndarrays).

raeedcho commented 3 years ago

I agree that the replication part is a bit undesirable, and perhaps splitting out the trial info into its own table isn't the best either. I'm curious which aspects of selection would be harder--could you elaborate?

bagibence commented 3 years ago

Sure. Sorry if I'm being overly skeptical, maybe I just haven't been able to wrap my head around it properly yet and need to see how it works in practice first.

My biggest (conceptual) problem is that we lose the trial as the main unit and can't think of a dataframe as a collection of trials anymore because it becomes a collection of time points with some trial info attached to each of them.

As soon as every row is a time point, basically every trial-wise operation would have to be done T times where T is the number of time points in the trial which is in the order of hundreds. That is true for querying and adding data. select_trials(df, "reaction_time < 200") would have to make T times more checks. Same for df.reaction_time = df.idx_movement_on - df.idx_target_on. Then of course it would also need T times as much storage. To get a histogram of reaction times you'd need to first get only one time point from each trial, then calculate on those instead of just df.reaction_time.hist().

How do you separate signals from each other? Is each neuron in each signal a separate column identified by some naming convention like "PMd_rates_0"? I'm sure querying for something like that is much more complicated than trial.PMd_rates. Or are the signals and their dimensions extra levels of the multi-index?

The biggest problem with the two-table solution is that we can't pass around a single dataframe object anymore.

raeedcho commented 3 years ago

No problem, thanks for talking it out! Let me preface by saying that I also don't have 100% of a possible implementation worked out either.

I agree that the biggest problem would be the replication of trial-level data across all time points of each trial, for the storage and computation time reasons you mentioned. I think the way people deal with this in practice with large datasets is splitting information into multiple tables of a database schema (which I think is called entity normalization), using table joins after various manipulations to get only the entities you want. For example, if your example, you'd have a trial_info table, and a trial_signals table (or something like that), and you'd select trials by df=trial_info.loc[trial_info['reaction_time']<200,:] and then get the signal data with full_td = df.join(trial_signals).

As you pointed out though, this would require some notion of a data schema with multiple tables, which I guess in our case would mean defining a class with DataFrame attributes for the schema. That's not ideal because I think we would rather have PyalData be a library of functions to apply directly to naked DataFrames (though I suppose the library already assumes a specific structure of the DataFrame, or else things wouldn't work). Possibly though, this style of structuring data falls more in the realm of analysis from some underlying relational database framework, like DataJoint, so it might just not be a good idea to make PyalData do something like this.

Regarding separating signals from each other--I haven't really figured it out yet. I think one solution would be to have a MultiIndex on the columns as well so neurons are separate but also together (but then what do we do for non-neural signals?). Another solution would be to just store them together as a vector, as we usually do for population analysis (but then what if we want to do single neuron analyses?). A third, more radical solution might be to "melt" the signals table entirely and make a table with index as ('trial_id','trialtime','signal_id') and value as the value of that signal at that time point for that trial, relying on Pandas' native ability to easy restructure tables at will (through pivots or stacking) for analysis.

Anyway, I think I'm going to close this because you've helped me convince myself that it's not a good idea to bring this change to PyalData, but I wondered what others thought of this structure of data storage so as to enable the use of Pandas' native data manipulation abilities.

bagibence commented 3 years ago

Yeah, I didn't really want to go into the multi-table discussion. I think as soon as we step away from the single-dataframe model we lose the simplicity and "magic" of TrialData. That said, I've used relational databases for neuroscience data and liked it -- Pony ORM is great.

NeuralAnalysis / PyalData

Using hierarchical indexing/MultiIndices to put timepoints of timevarying signals on individual rows? #114