carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

Error when trying to validate RAPIDS location features #206

Closed JulioV closed 1 year ago

JulioV commented 1 year ago

Discussed in https://github.com/carissalow/rapids/discussions/204

Originally posted by **VHolstein** January 16, 2023 Dear RAPIDS developers, I have started building my own RAPDIS container for an app that we use. For validation, I would like to compare GPS location features of the Barnett & Doryab pipelines. For testing, I have rearranged the location data into the same format as the aware_csv data. It seems to run fine, but I get this error message: `Warning message: The following named parsers don't match the column names: local_date_time, local_start_date_time, local_end_date_time` when running the job ``` rule phone_locations_processed_with_datetime: input: data/interim/test1/phone_locations_processed.csv, data/interim/time_segments/test1_time_segments.csv, data/external/participant_files/test1.yaml output: data/interim/test1/phone_locations_processed_with_datetime.csv, data/interim/test1/phone_locations_processed_with_datetime.done ``` and the output files are empty. GPS location data seems to look the same as the example csv for aware_csv data. The beginning of my config file is: ``` ######################################################################################################################## # GLOBAL CONFIGURATION # ######################################################################################################################## # See https://www.rapids.science/latest/setup/configuration/#participant-files PIDS: [test1, test2, test3] # See https://www.rapids.science/latest/setup/configuration/#automatic-creation-of-participant-files CREATE_PARTICIPANT_FILES: CSV_FILE_PATH: "data/external/ReMAP_test_participants.csv" # see docs for required format PHONE_SECTION: ADD: True IGNORED_DEVICE_IDS: [] FITBIT_SECTION: ADD: False IGNORED_DEVICE_IDS: [] EMPATICA_SECTION: ADD: False # See https://www.rapids.science/latest/setup/configuration/#time-segments TIME_SEGMENTS: &time_segments TYPE: PERIODIC # FREQUENCY, PERIODIC, EVENT FILE: "data/external/timesegments_remap.csv" INCLUDE_PAST_PERIODIC_SEGMENTS: TRUE # Only relevant if TYPE=PERIODIC, see docs # See https://www.rapids.science/latest/setup/configuration/#timezone-of-your-study TIMEZONE: TYPE: SINGLE SINGLE: TZCODE: Europe/Berlin MULTIPLE: TZCODES_FILE: /spm-data/vault-data4/ReMAP/project_vincent/data/time_segments/multiple_timezones_example.csv IF_MISSING_TZCODE: STOP DEFAULT_TZCODE: Europe/Berlin FITBIT: ALLOW_MULTIPLE_TZ_PER_DEVICE: False INFER_FROM_SMARTPHONE_TZ: False ``` Would you have an idea where this error might be coming from? I was unable to locate the problem. If needed I can provide my entire config file or any other of the necessary helper files. Thanks a lot in advance!
JulioV commented 1 year ago

@VHolstein I started this ticket based on our discussion, can you attach your config and data files directly to this thread so that my colleagues and I can take a look please? Having the files openly accessible will also help people with potentially similar issues in the future.

Thank you!

VHolstein commented 1 year ago

I ran an updated version with 10 days of data for a single participant and a single location feature. This lead to a more specific error arising:


Error in make_style(x[["color"]]) : 
  Unknown style specification: br_magenta
Calls: readable_datetime ... <Anonymous> -> .Call -> format -> as_datetime -> Ops.POSIXt

In addition: Warning message:
The following named parsers don't match the column names: local_date_time, local_start_date_time, local_end_date_time 
Execution halted
[Thu Jan 19 23:42:14 2023]
Error in rule phone_locations_processed_with_datetime:
    jobid: 2
    output: data/interim/test1/phone_locations_processed_with_datetime.csv, data/interim/test1/phone_locations_processed_with_datetime.done

RuleException:
CalledProcessError in line 126 of /rapids/rules/preprocessing.smk:
Command 'set -euo pipefail;  Rscript --vanilla /rapids/.snakemake/scripts/tmp0_apkpch.readable_datetime.R' returned non-zero exit status 1.
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2339, in run_wrapper
  File "/rapids/rules/preprocessing.smk", line 126, in __rule_phone_locations_processed_with_datetime
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 560, in _callback
  File "/opt/conda/envs/rapids/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 546, in cached_or_run
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2351, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

I suspect that it has to do with the datetime specification somehow, but I don't understand it exactly. I have uploaded my config file, one participant with scrambled locations including the participant file and one location feature. and used the wifi, calls, message (etc) csv's from the example so that the analysis would run through. You can access the data here: https://github.com/VHolstein/fix_rapids_analysis if you want to look at it.

Thank you for taking a look at this!

JulioV commented 1 year ago

There's a problem with the timestamp column of remap_debug.csv, it should be a unix timestamp as opposed to a string date time. But maybe this is just an issue with the data prepared for this bug report.

Can you update that and try again, and if the problem persists, upload a new test file, please? Thanks!

VHolstein commented 1 year ago

I have updated the file and tried again. This time no error message occurs, but the "original" problem occurs which is that the phone_locations_processed_with_datetime.csv file is completely empty. Below is the output when running the pipeline. I can not see an indication of why the output is empty, so I'm a bit unsure what might be the issue.

I also updated the remap_debug.csv file in the repository (https://github.com/VHolstein/fix_rapids_analysis) if you'd want to reproduce the issue.

Processing PERIODIC time segments for test1's data/external/participant_files/test1.yaml
[Mon Jan 30 16:55:08 2023]
Finished job 5.
3 of 12 steps (25%) done

[Mon Jan 30 16:55:08 2023]
rule phone_locations_processed_with_datetime:
    input: data/interim/test1/phone_locations_processed.csv, data/interim/time_segments/test1_time_segments.csv, data/external/participant_files/test1.yaml
    output: data/interim/test1/phone_locations_processed_with_datetime.csv, data/interim/test1/phone_locations_processed_with_datetime.done
    jobid: 2
    wildcards: pid=test1

Warning message:
Project requested R version '4.0.0' but '4.2.1' is currently being used 
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.5
✔ tidyr   1.1.2     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Warning message:
replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘pillar’ 

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Warning message:
The following named parsers don't match the column names: local_date_time, local_start_date_time, local_end_date_time 
Touching output file data/interim/test1/phone_locations_processed_with_datetime.done.
JulioV commented 1 year ago

Thank you! The issue is that the timestamps are missing three digits of resolution (milliseconds) so they are converted to a date time around 1970 and therefore all rows are filtered out by the dates set in the participant file (Jan 21 to Jun 22).

The solution would be to add those three digits back but not sure if you collected your data every second (so you could just add 000) or they were lost during data wrangling.

   timestamp local_timezone local_date_time    
       <dbl> <chr>          <chr>              
1 1611221289 Europe/Berlin  1970-01-19 16:33:41
2 1611221765 Europe/Berlin  1970-01-19 16:33:41
3 1611225138 Europe/Berlin  1970-01-19 16:33:45
4 1611236906 Europe/Berlin  1970-01-19 16:33:56
5 1611237213 Europe/Berlin  1970-01-19 16:33:57
6 1611237517 Europe/Berlin  1970-01-19 16:33:57
VHolstein commented 1 year ago

Thanks for the note about the timestamps! We collected data at the millisecond resolution but I converted it to seconds. For quick debugging, I just added 000 to the end of each timestamp and uploaded an updated version of the csv to the GitHub repo (https://github.com/VHolstein/fix_rapids_analysis). I ran a test run and it completed the conversion step 4 so that I have the phone_locations_processed_with_datetime.csv files. However, it already fails in the next step:

[Fri Feb  3 21:53:10 2023]
rule phone_locations_add_doryab_extra_columns:
    input: data/interim/test1/phone_locations_processed_with_datetime.csv
    output: data/interim/test1/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv
    jobid: 1
    wildcards: pid=test1

/rapids/src/features/phone_locations/doryab/add_doryab_extra_columns.py:35: UserWarning: We could not infer a home location because there are no location records logged during midnight to 6am.

[Fri Feb  3 21:53:14 2023]
Finished job 1.
5 of 12 steps (42%) done

Is this an issue with the density of the data is there some other error in my csv that I'm missing?

JulioV commented 1 year ago

It’s related to the warning message you can see, the default algorithm uses data collected between midnight and 6 am to compute a home location which in turn is used to compute features like time at home. I’d recommend reading the docs carefully because location features have many parameters that will greatly affect the features you get. we give general recommendations for some of them

try running the pipeline for a single person with their entire dataset and let us know if you get any crash or unexpected output. I’ll close this ticket for now but feel free to open a new one or a new discussion

thanks!