carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

Potential discrepancies in computed [PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS] features #196

Closed jenniferfedor closed 1 year ago

jenniferfedor commented 1 year ago

Hi @JulioV and @Meng6, we've noticed some potential discrepancies in computed [PHONE_APPLICATIONS_FOREGROUND][PROVIDERS][RAPIDS] features.

Specifically:

  1. There are some instances in which the sumduration feature for a given application and time segment is equal to 0 but the corresponding timeoffirstuse feature has a non-missing value
  2. There are some instances in which the timeoffirstuse feature for a given application and time segment is missing when the corresponding sumduration feature has a value > 0 and < 60

It seems that all time segments (in this particular case, hourly) and various single apps and app categories are affected. Is this behavior expected?

I'm running the latest version of RAPIDS on MacOS Monterey version 12.4.

JulioV commented 1 year ago

@Meng6 what would you need to debug this error if you have some time? Some test data to replicate the problem?

Meng6 commented 1 year ago

Sure, @JulioV. Before debugging, I want to double check the following things with @jenniferfedor:

  1. There are two parameters in config.yaml file which might lead to the 1st case you mentioned above: IGNORE_EPISODES_SHORTER_THAN and IGNORE_EPISODES_LONGER_THAN.
  2. Do you set INCLUDE_EPISODE_FEATURES=True? If so, could you check if the phone_app_episodes.csv file is correct or not for the above 2 cases?
jenniferfedor commented 1 year ago

Thanks, @Meng6! And thank you for the suggestion to check the phone_app_episodes.csv files.

We used the default values for the IGNORE_EPISODES_SHORTER_THAN and IGNORE_EPISODES_LONGER_THAN parameters (0 and 300 minutes, respectively). Spot checking a few participants, it looks like the instances where sumduration was zero but the corresponding timeoffirstuse was non-missing were associated with app episodes that exceeded that threshold value. The timeoffirstuse value is consistent with the start_timestamp for that episode.

We did set INCLUDE_EPISODE_FEATURES = True. Again spot checking a few participants, it looks like timeoffirstuse is missing for a given time segment when the corresponding sumduration is non-zero when: (1) the app episode spans a time segment (e.g., begins during the previous hourly segment and ends during the current hourly segment) and (2) no other app episodes begin during the current time segment. If one or more episodes also begin during the current time segment, timeoffirstuse is non-missing and is consistent with the start time of the first episode to begin in the current time segment.

Please let me know if you need any other info!

Meng6 commented 1 year ago

Thanks, @jenniferfedor!

I think the reason why "timeoffirstuse was non-missing while the episodes larger than 300 minutes were dropped" is that we only dropped episodes dataframe without updating the event dataframe. Hi @JulioV, do you think if we need to update the event dataframe accordingly?

In addition, instead of checking phone_app_episodes.csv file, @jenniferfedor could you check the data after this line?

JulioV commented 1 year ago

I think the fix is easy, when the user requests episode features we compute event and episode features with episodes data. When the user does not request episode features we compute event features with event data

Meng6 commented 1 year ago

Hi @JulioV, does it mean IGNORE_EPISODES_SHORTER_THAN and IGNORE_EPISODES_LONGER_THAN only work when INCLUDE_EPISODE_FEATURES=True?

jenniferfedor commented 1 year ago

Thank you, @Meng6! Checking the output of this line with the events and episodes data for an affected participant, time segment, and app:

  1. In the first case where sumduration is 0 but timeoffirstuse is not missing, the filtered episodes dataframe is empty (presumably because the associated episode, which exceeded the IGNORE_EPISODES_LONGER_THAN threshold, is dropped), but there are data in the filtered events dataframe. That seems to explain why we have a value for the event feature but not for the episode feature (which is really NA and then filled with 0)
  2. In the second case where timeoffirstuse is missing but there is a non-zero value for sumduration, there were no data for an affected app in the filtered events dataframe, but there were data for that app in the filtered episodes dataframe. That seems to explain why we have a value for the episode feature but not for the event feature

$~$

@JulioV, thank you for suggesting that solution! To implement this, we can edit the rapids_features() function to compute event and episode features with the episodes data if requested and event features with the events data otherwise. That is an easy fix! We will also have to make a few edits to the chunk_episodes() function because some columns required to compute the event features are presently not included in the episodes data after we filter the data by time segment. Specifically those edits are:

I did test making all of those proposed changes and it seems to resolve both issues. Spot checking the same participant, time segments, and apps as above:

  1. In the first case where sumduration was 0 but timeoffirstuse was not missing, no features are computed for the time segment. I think this makes sense because the associated app episode was dropped for exceeding the IGNORE_EPISODES_LONGER_THAN threshold, so there are no data from which to compute event or episode features
  2. In the second case where timeoffirstuse was missing but there was a non-zero value for sumduration, the computed value of sumduration is unchanged, but we now have a value for timeoffirstuse, which corresponds to the start of the time segment. I think this also makes sense because this app episode spanned time segments, so timeoffirstuse now corresponds to the time the app was first used within this time segment, despite the app episode beginning in a previous time segment

Does that all sound okay to you both? Thanks so much again for your help!

JulioV commented 1 year ago

Thanks a lot, this all makes sense and it is a great plan. Couple of comments; I believe we can rename the start_timestamp column instead of making a copy to save memory. We should also add tests for this fix and the other thread to make sure we have test coverage in the future for these edge cases. @ChinW97 has experience implementing such tests.

Please submit a Pull Request and @Meng6 and I will review before we merge and publish the changes

Meng6 commented 1 year ago

Thanks so much for working on this, @jenniferfedor !

As the chunk_episodes() function is also called by other sensors' episode feature extraction scripts, I'm worried that updating this function would lead to issues.

Does it make sense? @jenniferfedor @JulioV

jenniferfedor commented 1 year ago

Great, thank you both! And thank you for that suggestion, @Meng6 -- that makes sense to me!

JulioV commented 1 year ago

Fixed