carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

Error parsing Fitbit heartrate summary JSON data in pull_wearable_data rule #227

Closed jenniferfedor closed 10 months ago

jenniferfedor commented 10 months ago

When processing Fitbit heartrate summary data for a particular device from a single participant using the Fitbit JSON MySQL data stream, we encountered the following error when executing the pull_wearable_data rule:

rule pull_wearable_data:
    input: data/external/participant_files/p1170.yaml, src/data/streams/rapids_columns.yaml, src/data/streams/fitbitjson_mysql/format.yaml, src/data/streams/fitbitjson_mysql/container.R, src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py, src/data/streams/mutations/fitbit/add_zero_timestamp.py
    output: data/raw/p1170/fitbit_heartrate_summary_raw.csv
    jobid: 1
    wildcards: pid=p1170, device_type=fitbit, sensor=heartrate_summary

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
package ‘readr’ was built under R version 4.0.5 

Processing FITBIT_HEARTRATE_SUMMARY for cf0992de-be2e-4070-ac6c-2f71f857aab0
Executing the following query to download data: SELECT device_id,fitbit_data FROM fitbit_data_from_api_v2 WHERE device_id = 'cf0992de-be2e-4070-ac6c-2f71f857aab0'
Applying mutation script src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py

Error in `mutate_cols()`:
! Problem with `mutate()` input `..1`.
✖ missing value where TRUE/FALSE needed
ℹ Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Caused by error in `if (!is.character(value) && !is.nan(value)) ...`:
! missing value where TRUE/FALSE needed
Backtrace:
     ▆
  1. ├─global mutate_data(mutation_scripts, renamed_data, data_configuration)
  2. │ └─data %>% ...
  3. ├─dplyr::mutate(., across(where(is.list), fix_pandas_nan_in_string_columns))
  4. ├─dplyr:::mutate.data.frame(., across(where(is.list), fix_pandas_nan_in_string_columns))
  5. │ └─dplyr:::mutate_cols(.data, ...)
  6. │   ├─base::withCallingHandlers(...)
  7. │   └─mask$eval_all_mutate(quo)
  8. ├─global `<fn>`(heartrate_daily_restinghr)
  9. │ └─base::vapply(...)
 10. │   └─FUN(X[[i]], ...)
 11. └─base::.handleSimpleError(...)
 12.   └─dplyr (local) h(simpleError(msg, call))
 13.     └─rlang::abort(...)
Execution halted


We are using RAPIDS v1.9.4 running on Ubuntu 20.04. It seems the error is caused by the use of None to represent missing values in the src/data/streams/mutations/fitbit/parse_heartrate_summary_json.py mutation script, which is executed within the src/data/streams/pull_wearable_data.R script via {reticulate}. In the python script, missing values for expected columns are set to None. None values in a pandas series (e.g., a DataFrame column) are normally coerced to NaN when other numeric values are present, and python's NaN is also interpreted as NaN within R. However, this device for this participant had only one row of Fitbit heartrate summary data and a missing value for heartrate_daily_restinghr which was set to None. Because there were no other numeric values present in that column, this value of None is not coerced to NaN and is interpreted by R as NULL. Evaluating NULL with !is.nan() returns a logical vector of length 0 rather than a TRUE or FALSE as expected, resulting in this error. To account for this, we can replace any instances of None in the mutation script with np.NaN.

jenniferfedor commented 10 months ago

Fixed in #226.