carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

Aggregating data across participants #216

Closed sjgiorgi closed 1 year ago

sjgiorgi commented 1 year ago

I'd like to aggregate each sensor data point across the entire study time span. In Segment Examples, I see examples on how to do this for daily, weekly, etc. time chunks. Is there a way to make 1 time chunk per participant?

What I've tried is using one event participants where each event is the entire duration of the study (where the maximum timestamp across all sensors is used as the end of the participant's study). Is this correct? Is there an easier way to do this within Rapids (right now I'm manually checking the timestamps across all sensors in order to find the maximum).

jenniferfedor commented 1 year ago

Hi @sjgiorgi, thank you for using RAPIDS! Currently, creating event segments as you describe would be the only way to extract features across the entire duration of each participant-specific study period. The ability to automatically pull those event segment start times and lengths based on all sensed timestamps is something we could potentially add as an enhancement in the future. Thanks for bringing this to our attention!

sjgiorgi commented 1 year ago

Thank you!

Is using maximum timestamp across all sensors correct? Or do we need a separate timestamp for each sensor?

For example, if location data ends on May 15 and wifi data ends on May 20, will setting the event timestamp to May 20 have any effect on the aggregate location data (that has 5 days without data)?

jenniferfedor commented 1 year ago

Hi @sjgiorgi, that’s a great question! I was not sure myself so I extracted features for a representative participant from one of our studies on event segments delineated by each sensor-specific minimum and maximum timestamp and by the overall minimum and maximum timestamp across all sensors.

We had about 1 month of activity recognition, battery, calls, locations, and screen data available for this participant. All of the sensor-specific features extracted on each respective sensor-specific event segment were exactly equal to the corresponding features extracted on the "overall" event segment. Values for phone data features were not equal across these segments, but the differences were fairly minimal (in the range of about 0.03).

Based on this, I think creating one event segment per participant using the maximum timestamp across all sensors should be okay. Alternatively, to be extra safe, you could consider creating one event segment per participant and sensor and discarding the irrelevant features (e.g., activity recognition features extracted within the battery event segment and vice versa, but potentially retaining all data yield features) after processing. Please let us know if you have any additional questions!

sjgiorgi commented 1 year ago

This is super helpful, thank you!