PsychoinformaticsLab / pliers

Automated feature extraction in Python
https://pliers.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
296 stars 68 forks source link

object_id computed incorrectly when duration set to single value #384

Open adelavega opened 4 years ago

adelavega commented 4 years ago

ExtractorResult.duration can be set to a single value (e.g. 0.1) when this value represents durations for all data points. This should be broadcast to all values, and is when creating df.

However, when computing object_id this is not done, and it causes the index that is sued to groupby the data to be a single value, followed by all NAs.

https://github.com/tyarkoni/pliers/blob/7e92ad50b3996a0df1f3dfcc91c1f006597b5728/pliers/extractors/base.py#L154

This results in object_ids like so:

0, 0, 1, 2, 3, 4, 5, 6

When it should be all 0s.

Broadcasting prior to making the pd.Series would fix this issue.

adelavega commented 4 years ago

@rbroc do you remember if we fixed this? I feel like we did but I'm not seeing a PR for it

rbroc commented 4 years ago

@rbroc do you remember if we fixed this? I feel like we did but I'm not seeing a PR for it

We didn't. I fixed it within the AudiosetLabelExtractor (which is where we observed the anomaly) by forcing durations to be a vector of length len(onsets).

I guess the question is whether we want to add explicit broadcasting at https://github.com/tyarkoni/pliers/blob/7e92ad50b3996a0df1f3dfcc91c1f006597b5728/pliers/extractors/base.py#L154

I feel like that might be a bit more convoluted than just sticking to the rule that ExtractorResults should have onset and duration as lists of the same length (with one value per row of the output). So far, the overwhelming majority of extractors do that already.

Rather than changing the way object_id is created, we might just want to add a line checking len(onsets) == len(durations) and triggering an Error if this is not the case.