NeuralAnalysis / PyalData

Repository for the Python implementation of the TrialData analysis library.
GNU General Public License v3.0
7 stars 9 forks source link

`clean_integer_fields` cannot convert `idx_...` field to integer sometimes #111

Open raeedcho opened 3 years ago

raeedcho commented 3 years ago

Situation: In some of my TrialData files, there are some trials when there is one index in an idx_... field and other trials where there are multiple entries. For example, in my center-out behavior, there is an index for when the monkey starts holding in the center target. But if the monkey leaves the center target, the trial is on pause until he comes back, at which point the idx_ctHoldTime value for that trial will be something like [10,50].

Relatedly, there are some trials in which an event doesn't happen at all, and so I set the value for that idx_... field to NaN for that trial.

Problem: Because clean_integer_fields (used by mat2dataframe) assumes a similar format of indices over all trials, it doesn't catch these edge case and can't convert the field to an integer. In the case of the doubled up values, they probably should be converted, and in the case of the NaNs, I suppose they should be set to some other invalid integer.

Solution(?): One solution would have been to re-code my behavior so that when the monkey leaves the center target, it would abort the trial and move on to the next one (which is unfortunately not an option for me). Relatedly, I could reorganize my TrialData structure in MATLAB to add in fake trial aborts and/or only consider the last time (neither is too appealing to me because they wouldn't capture the spirit of what the monkey was doing at the time).

Another solution might be to go trial by trial in clean_integer_fields rather than try to convert the whole column at once, but this strikes me as a pretty clunky solution. Still, maybe it's the simplest to implement and easy enough to do? I think the first if block of the function code actually does exactly this, so perhaps the second block isn't needed?

For the NaN issue, I'm a bit confused. The code seems to treat np.nan as the "right" thing to put in for invalid indices, and yet it's not stored as an integer. Does that mean that idx_... fields aren't meant to be integers? The way it's currently dealt with seems to be that if the integer conversion doesn't result in the same value, it's left alone (sometimes without a warning because an array of NaNs converted to an array of ints actually doesn't throw an error)--so maybe a solution any of this is unnecessary, but it sits weirdly with me that some of my idx... fields might be ints and some might not be.

Thoughts?

bagibence commented 3 years ago

Sorry for the late response. Thanks for raising the issue! We've had some problems related to this in #69

Generally idx_... fields are meant to be integers and missing values nans. (I don't want to use e.g. -1 because of the reverse indexing in Python). The underlying problem is that pandas stores values column-wise and every column has its own dtype which means that the type can't change from trial to trial (except if the column's dtype is object), and once one trial has an np.nan -- which can't be stored as an integer -- the whole column will change to float to accommodate that.

It's really good that you're raising this because now I found this and this, which look like what we need to solve the missing value problem. I'll look into this and send a PR if it seems to work.

As for having multiple values for the same index: as long as there is an array in the column, that column will have the object dtype and won't be converted to int. I'm not sure I understand your situation exactly, but I think the general solution in this case is if in some trials the field is an array, then convert it to an array in every trial. This might make using functions like restrict_to_interval a bit trickier (but still usable with the epoch_fun option) but I think it's better to have a consistent type than to have it change randomly from trial to trial.

bagibence commented 3 years ago

Reopening because I reverted merging the PR in #115 because it introduced problems with simple queries like select_trials(df, "block_target_kappa != 20."). We'll have to think about this some more and do more testing.