Closed JulioV closed 1 year ago
@VHolstein I started this ticket based on our discussion, can you attach your config and data files directly to this thread so that my colleagues and I can take a look please? Having the files openly accessible will also help people with potentially similar issues in the future.
Thank you!
I ran an updated version with 10 days of data for a single participant and a single location feature. This lead to a more specific error arising:
Error in make_style(x[["color"]]) :
Unknown style specification: br_magenta
Calls: readable_datetime ... <Anonymous> -> .Call -> format -> as_datetime -> Ops.POSIXt
In addition: Warning message:
The following named parsers don't match the column names: local_date_time, local_start_date_time, local_end_date_time
Execution halted
[Thu Jan 19 23:42:14 2023]
Error in rule phone_locations_processed_with_datetime:
jobid: 2
output: data/interim/test1/phone_locations_processed_with_datetime.csv, data/interim/test1/phone_locations_processed_with_datetime.done
RuleException:
CalledProcessError in line 126 of /rapids/rules/preprocessing.smk:
Command 'set -euo pipefail; Rscript --vanilla /rapids/.snakemake/scripts/tmp0_apkpch.readable_datetime.R' returned non-zero exit status 1.
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2339, in run_wrapper
File "/rapids/rules/preprocessing.smk", line 126, in __rule_phone_locations_processed_with_datetime
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 560, in _callback
File "/opt/conda/envs/rapids/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 546, in cached_or_run
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2351, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I suspect that it has to do with the datetime specification somehow, but I don't understand it exactly. I have uploaded my config file, one participant with scrambled locations including the participant file and one location feature. and used the wifi, calls, message (etc) csv's from the example so that the analysis would run through. You can access the data here: https://github.com/VHolstein/fix_rapids_analysis if you want to look at it.
Thank you for taking a look at this!
There's a problem with the timestamp column of remap_debug.csv
, it should be a unix timestamp as opposed to a string date time. But maybe this is just an issue with the data prepared for this bug report.
Can you update that and try again, and if the problem persists, upload a new test file, please? Thanks!
I have updated the file and tried again. This time no error message occurs, but the "original" problem occurs which is that the phone_locations_processed_with_datetime.csv file is completely empty. Below is the output when running the pipeline. I can not see an indication of why the output is empty, so I'm a bit unsure what might be the issue.
I also updated the remap_debug.csv file in the repository (https://github.com/VHolstein/fix_rapids_analysis) if you'd want to reproduce the issue.
Processing PERIODIC time segments for test1's data/external/participant_files/test1.yaml
[Mon Jan 30 16:55:08 2023]
Finished job 5.
3 of 12 steps (25%) done
[Mon Jan 30 16:55:08 2023]
rule phone_locations_processed_with_datetime:
input: data/interim/test1/phone_locations_processed.csv, data/interim/time_segments/test1_time_segments.csv, data/external/participant_files/test1.yaml
output: data/interim/test1/phone_locations_processed_with_datetime.csv, data/interim/test1/phone_locations_processed_with_datetime.done
jobid: 2
wildcards: pid=test1
Warning message:
Project requested R version '4.0.0' but '4.2.1' is currently being used
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.5
✔ tidyr 1.1.2 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Warning message:
replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘pillar’
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
Warning message:
The following named parsers don't match the column names: local_date_time, local_start_date_time, local_end_date_time
Touching output file data/interim/test1/phone_locations_processed_with_datetime.done.
Thank you! The issue is that the timestamps are missing three digits of resolution (milliseconds) so they are converted to a date time around 1970 and therefore all rows are filtered out by the dates set in the participant file (Jan 21 to Jun 22).
The solution would be to add those three digits back but not sure if you collected your data every second (so you could just add 000
) or they were lost during data wrangling.
timestamp local_timezone local_date_time
<dbl> <chr> <chr>
1 1611221289 Europe/Berlin 1970-01-19 16:33:41
2 1611221765 Europe/Berlin 1970-01-19 16:33:41
3 1611225138 Europe/Berlin 1970-01-19 16:33:45
4 1611236906 Europe/Berlin 1970-01-19 16:33:56
5 1611237213 Europe/Berlin 1970-01-19 16:33:57
6 1611237517 Europe/Berlin 1970-01-19 16:33:57
Thanks for the note about the timestamps! We collected data at the millisecond resolution but I converted it to seconds. For quick debugging, I just added 000
to the end of each timestamp and uploaded an updated version of the csv to the GitHub repo (https://github.com/VHolstein/fix_rapids_analysis). I ran a test run and it completed the conversion step 4 so that I have the phone_locations_processed_with_datetime.csv
files. However, it already fails in the next step:
[Fri Feb 3 21:53:10 2023]
rule phone_locations_add_doryab_extra_columns:
input: data/interim/test1/phone_locations_processed_with_datetime.csv
output: data/interim/test1/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv
jobid: 1
wildcards: pid=test1
/rapids/src/features/phone_locations/doryab/add_doryab_extra_columns.py:35: UserWarning: We could not infer a home location because there are no location records logged during midnight to 6am.
[Fri Feb 3 21:53:14 2023]
Finished job 1.
5 of 12 steps (42%) done
Is this an issue with the density of the data is there some other error in my csv that I'm missing?
It’s related to the warning message you can see, the default algorithm uses data collected between midnight and 6 am to compute a home location which in turn is used to compute features like time at home. I’d recommend reading the docs carefully because location features have many parameters that will greatly affect the features you get. we give general recommendations for some of them
try running the pipeline for a single person with their entire dataset and let us know if you get any crash or unexpected output. I’ll close this ticket for now but feel free to open a new one or a new discussion
thanks!
Discussed in https://github.com/carissalow/rapids/discussions/204