Open raeedcho opened 2 years ago
This is a great question, sorry for getting around to it so late.
I'm not exactly sure what the reasoning was to use trial_id
as an identifier.
Regarding your two points:
df.query
quite a lot and trial_data.query(f'trial_id == {trial_id_val}')
or trial_data.query(f'trial_id in {trial_id_val}')
might be an easier way of doing that.Other reasons I can think of:
df.query
trial_id
is not that important, and that linear indexing is more natural for most operations. For example when doing things like df[i]
for quick testing, and it's unexpected if that throws errors because i
is not in the index. Always having a linear index, it's not a problem if we forget about using .iloc
.That said, I see you point, and I don't think we rely on it in too many places, so we can think about switching.
Ah, yes, I was mistaken about Pandas indices being unique, which makes this a bit frustrating (possibly I had DataJoint in mind when I was thinking about this issue before).
Interesting point about other frameworks not having labeled indices--make it makes sense not to give pandas indices special meaning and rather just have a column for trial_id
where we need it.
Still, I could imagine a scenario where I might not want to have a column called trial_id
, and as you say, the rest of the codebase doesn't seem to rely on having that column. So maybe it would make sense to either:
1) Just use the DataFrame index for warnings
or
2) Use the trial_id
column if it exists, but if it doesn't, fall back to the DataFrame index
For what it's worth, in my fork of PyalData, I've just moved to (1) in interval.py
In
interval.py
,trial_data.trial_id
is used in warnings to flag trials where things have gone wrong. I suggest changing this totrial_data.index
instead, for two reasons:trial_data.loc[index_val,:]
rather than using more cumbersome selection code liketrial_data.loc[trial_data['trial_id']==trial_id_val,:]
(I admit I may be unaware of a really simple way to do this)trial_id
,trial_data.index
has to exist in the DataFrame, and entries in it have to be unique, which adds a bit of robustness to the framework.For some context, in my workflow, I end up setting
trial_id
as the DataFrame index and make sure to setreset_index=False
whenever using PyalData functions with the option (another thing I think should be default, but perhaps that's another issue). This gets rid of the column from the DataFrame, but gives me handles for each row of the DataFrame as I would normally use thetrial_id
. However, I get errors when a PyalData function refers totrial_data.trial_id
, and I think in at least most cases, it would be better to usetrial_data.index
anyway. I doubt that this change would break anyone's code, but I've certainly been wrong about what breaks code before 😄If others agree, I can open up a pull request.