Closed jenniferfedor closed 1 year ago
@Meng6 what would you need to debug this error if you have some time? Some test data to replicate the problem?
Sure, @JulioV. Before debugging, I want to double check the following things with @jenniferfedor:
IGNORE_EPISODES_SHORTER_THAN
and IGNORE_EPISODES_LONGER_THAN
.INCLUDE_EPISODE_FEATURES
=True? If so, could you check if the phone_app_episodes.csv
file is correct or not for the above 2 cases?Thanks, @Meng6! And thank you for the suggestion to check the phone_app_episodes.csv
files.
We used the default values for the IGNORE_EPISODES_SHORTER_THAN
and IGNORE_EPISODES_LONGER_THAN
parameters (0 and 300 minutes, respectively). Spot checking a few participants, it looks like the instances where sumduration
was zero but the corresponding timeoffirstuse
was non-missing were associated with app episodes that exceeded that threshold value. The timeoffirstuse
value is consistent with the start_timestamp
for that episode.
We did set INCLUDE_EPISODE_FEATURES
= True. Again spot checking a few participants, it looks like timeoffirstuse
is missing for a given time segment when the corresponding sumduration
is non-zero when: (1) the app episode spans a time segment (e.g., begins during the previous hourly segment and ends during the current hourly segment) and (2) no other app episodes begin during the current time segment. If one or more episodes also begin during the current time segment, timeoffirstuse
is non-missing and is consistent with the start time of the first episode to begin in the current time segment.
Please let me know if you need any other info!
Thanks, @jenniferfedor!
I think the reason why "timeoffirstuse
was non-missing while the episodes larger than 300 minutes were dropped" is that we only dropped episodes dataframe without updating the event dataframe. Hi @JulioV, do you think if we need to update the event dataframe accordingly?
In addition, instead of checking phone_app_episodes.csv
file, @jenniferfedor could you check the data
after this line?
I think the fix is easy, when the user requests episode features we compute event and episode features with episodes data. When the user does not request episode features we compute event features with event data
Hi @JulioV, does it mean IGNORE_EPISODES_SHORTER_THAN and IGNORE_EPISODES_LONGER_THAN only work when INCLUDE_EPISODE_FEATURES=True?
Thank you, @Meng6! Checking the output of this line with the events and episodes data for an affected participant, time segment, and app:
sumduration
is 0 but timeoffirstuse
is not missing, the filtered episodes dataframe is empty (presumably because the associated episode, which exceeded the IGNORE_EPISODES_LONGER_THAN
threshold, is dropped), but there are data in the filtered events dataframe. That seems to explain why we have a value for the event feature but not for the episode feature (which is really NA and then filled with 0) timeoffirstuse
is missing but there is a non-zero value for sumduration
, there were no data for an affected app in the filtered events dataframe, but there were data for that app in the filtered episodes dataframe. That seems to explain why we have a value for the episode feature but not for the event feature $~$
@JulioV, thank you for suggesting that solution! To implement this, we can edit the rapids_features()
function to compute event and episode features with the episodes data if requested and event features with the events data otherwise. That is an easy fix! We will also have to make a few edits to the chunk_episodes()
function because some columns required to compute the event features are presently not included in the episodes data after we filter the data by time segment. Specifically those edits are:
local_hour
and local_minute
columns, which are required to compute both timeoffirstuse
and timeoflastuse
, rather than drop them, and timestamp
column. Could this just be a copy of start_timestamp
? I did test making all of those proposed changes and it seems to resolve both issues. Spot checking the same participant, time segments, and apps as above:
sumduration
was 0 but timeoffirstuse
was not missing, no features are computed for the time segment. I think this makes sense because the associated app episode was dropped for exceeding the IGNORE_EPISODES_LONGER_THAN
threshold, so there are no data from which to compute event or episode features timeoffirstuse
was missing but there was a non-zero value for sumduration
, the computed value of sumduration
is unchanged, but we now have a value for timeoffirstuse
, which corresponds to the start of the time segment. I think this also makes sense because this app episode spanned time segments, so timeoffirstuse
now corresponds to the time the app was first used within this time segment, despite the app episode beginning in a previous time segment Does that all sound okay to you both? Thanks so much again for your help!
Thanks a lot, this all makes sense and it is a great plan. Couple of comments; I believe we can rename the start_timestamp column instead of making a copy to save memory. We should also add tests for this fix and the other thread to make sure we have test coverage in the future for these edge cases. @ChinW97 has experience implementing such tests.
Please submit a Pull Request and @Meng6 and I will review before we merge and publish the changes
Thanks so much for working on this, @jenniferfedor !
As the chunk_episodes()
function is also called by other sensors' episode feature extraction scripts, I'm worried that updating this function would lead to issues.
timestamp
column in chunk_episodes()
function, can we update the phone_applications_foreground/rapids/main.py script? For example, we can check if timestamp
column exists before sorting. If not, use start_timestamp
instead.Does it make sense? @jenniferfedor @JulioV
Great, thank you both! And thank you for that suggestion, @Meng6 -- that makes sense to me!
Fixed
Hi @JulioV and @Meng6, we've noticed some potential discrepancies in computed
[PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS]
features.Specifically:
sumduration
feature for a given application and time segment is equal to 0 but the correspondingtimeoffirstuse
feature has a non-missing valuetimeoffirstuse
feature for a given application and time segment is missing when the correspondingsumduration
feature has a value > 0 and < 60It seems that all time segments (in this particular case, hourly) and various single apps and app categories are affected. Is this behavior expected?
I'm running the latest version of RAPIDS on MacOS Monterey version 12.4.