NeuralAnalysis / PyalData

Repository for the Python implementation of the TrialData analysis library.
GNU General Public License v3.0
7 stars 9 forks source link

Consider using DataFrame index to flag problematic trials in warnings #128

Open raeedcho opened 2 years ago

raeedcho commented 2 years ago

In interval.py, trial_data.trial_id is used in warnings to flag trials where things have gone wrong. I suggest changing this to trial_data.index instead, for two reasons:

  1. With index returned, a user can quickly pull up the problematic trial using trial_data.loc[index_val,:] rather than using more cumbersome selection code like trial_data.loc[trial_data['trial_id']==trial_id_val,:] (I admit I may be unaware of a really simple way to do this)
  2. Unlike trial_id, trial_data.index has to exist in the DataFrame, and entries in it have to be unique, which adds a bit of robustness to the framework.

For some context, in my workflow, I end up setting trial_id as the DataFrame index and make sure to set reset_index=False whenever using PyalData functions with the option (another thing I think should be default, but perhaps that's another issue). This gets rid of the column from the DataFrame, but gives me handles for each row of the DataFrame as I would normally use the trial_id. However, I get errors when a PyalData function refers to trial_data.trial_id, and I think in at least most cases, it would be better to use trial_data.index anyway. I doubt that this change would break anyone's code, but I've certainly been wrong about what breaks code before 😄

If others agree, I can open up a pull request.

bagibence commented 2 years ago

This is a great question, sorry for getting around to it so late.

I'm not exactly sure what the reasoning was to use trial_id as an identifier.

Regarding your two points:

  1. I use df.query quite a lot and trial_data.query(f'trial_id == {trial_id_val}') or trial_data.query(f'trial_id in {trial_id_val}') might be an easier way of doing that.
  2. Pandas indices don't have to be unique :(

Other reasons I can think of:

  1. Consistency with the MATLAB version? No sure about this because I haven't used it.
  2. Consistency with TrialData.jl because Julia's DataFrames don't have indices like pandas.
  3. Maybe plays nicer with df.query
  4. Maybe I just thought that trial_id is not that important, and that linear indexing is more natural for most operations. For example when doing things like df[i] for quick testing, and it's unexpected if that throws errors because i is not in the index. Always having a linear index, it's not a problem if we forget about using .iloc.

That said, I see you point, and I don't think we rely on it in too many places, so we can think about switching.

raeedcho commented 2 years ago

Ah, yes, I was mistaken about Pandas indices being unique, which makes this a bit frustrating (possibly I had DataJoint in mind when I was thinking about this issue before).

Interesting point about other frameworks not having labeled indices--make it makes sense not to give pandas indices special meaning and rather just have a column for trial_id where we need it.

Still, I could imagine a scenario where I might not want to have a column called trial_id, and as you say, the rest of the codebase doesn't seem to rely on having that column. So maybe it would make sense to either:

1) Just use the DataFrame index for warnings or 2) Use the trial_id column if it exists, but if it doesn't, fall back to the DataFrame index

For what it's worth, in my fork of PyalData, I've just moved to (1) in interval.py