carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

Corrupted CSV file output during rule resample_episodes_with_datetime when computing battery behavioral feature #197

Closed zacharyfried closed 1 year ago

zacharyfried commented 1 year ago

We are running into an issue with computing the battery behavioral feature using the RAPIDS provider for a small number of our participants. The computation runs normally if we exclude these participants. In short, the rule resample_episodes_with_datetime seems to be producing a corrupted phone_battery_episodes_resampled_with_datetime.csv file for these participants. There are several column with names that appear to be swapped with other columns as well as values in some columns that don’t appear to match any of the expected categories (e.g. the 3rd column is labeled battery_level, but contains values ranging from 4567 to 6310, while the 4th column is labeled battery_status, but contains values ranging from 46 to 97). When this CSV file is used as input for rule phone_battery_python_features, we get the following ValueError: ​

[Thu Nov  3 15:05:40 2022]
rule phone_battery_python_features:
    input: data/interim/PID01/phone_battery_episodes_resampled_with_datetime.csv, data/interim/time_segments/PID01_time_segments_labels.csv
    output: data/interim/PID01/phone_battery_features/phone_battery_python_rapids.csv
    jobid: 2622
    wildcards: pid=PID01, provider_key=rapids
    resources: tmpdir=/tmp
​
RAPIDS: Processing phone_battery rapids daily_RR0SS
Traceback (most recent call last):
  File "/data/rapids_covid_data/clean_rapids/rapids/.snakemake/scripts/tmpengqpnw5.entry.py", line 21, in <module>
    sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/../src/features/utils/utils.py", line 109, in fetch_provider_features
    features = feature_function(sensor_data_files, time_segment, provider, filter_data_by_segment=filter_data_by_segment, chunk_episodes=chunk_episodes)
  File "src/features/phone_battery/rapids/main.py", line 16, in rapids_features
    battery_data = filter_data_by_segment(battery_data, time_segment)
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/../src/features/utils/utils.py", line 36, in filter_data_by_segment
    data = chunk_episodes(data)
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/../src/features/utils/utils.py", line 84, in chunk_episodes
    merged_sensor_episodes["local_start_date_time"] = pd.concat([data["local_start_date_time"].dt.tz_convert(tz) for tz, data in merged_sensor_episodes.groupby("local_timezone")]).apply(lambda x: x.tz_localize(None).replace(microsecond=0))
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/../src/features/utils/utils.py", line 84, in <listcomp>
    merged_sensor_episodes["local_start_date_time"] = pd.concat([data["local_start_date_time"].dt.tz_convert(tz) for tz, data in merged_sensor_episodes.groupby("local_timezone")]).apply(lambda x: x.tz_localize(None).replace(microsecond=0))
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pandas/core/accessor.py", line 93, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pandas/core/indexes/accessors.py", line 121, in _delegate_method
    result = method(*args, **kwargs)
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pandas/core/indexes/datetimes.py", line 274, in tz_convert
    arr = self._data.tz_convert(tz)
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 851, in tz_convert
    tz = timezones.maybe_get_tz(tz)
  File "pandas/_libs/tslibs/timezones.pyx", line 111, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "pandas/_libs/tslibs/timezones.pyx", line 136, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pytz/__init__.py", line 500, in FixedOffset
    info = _tzinfos.setdefault(offset, _FixedOffset(offset))
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pytz/__init__.py", line 404, in __init__
    raise ValueError("absolute offset is too large", minutes)
ValueError: ('absolute offset is too large', 26893441067.3)
[Thu Nov  3 15:05:41 2022]
Error in rule phone_battery_python_features:
    jobid: 2622
    output: data/interim/PID01/phone_battery_features/phone_battery_python_rapids.csv
​
RuleException:
CalledProcessError in line 226 of /data/rapids_covid_data/clean_rapids/rapids/rules/features.smk:
Command 'set -euo pipefail;  /data/mamba/mambaforge/envs/rapids_r4_0/bin/python3.7 /data/rapids_covid_data/clean_rapids/rapids/.snakemake/scripts/tmpengqpnw5.entry.py' returned non-zero exit status 1.
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/features.smk", line 226, in __rule_phone_battery_python_features
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-11-03T125900.109567.snakemake.log

​ I have attached CSV files to this post with PID01 being a participant that leads to the corrupted CSV file and Value Error, while PID02 is a participant whose inclusion in the run doesn’t cause an error. ​ Install Details ​ Using R 4.0.5 and Ubuntu 18.04 on commit d255f2d (July 7th, 2022).

phone_battery_episodes_resampled_PID01.csv phone_battery_episodes_resampled_PID02.csv phone_battery_episodes_resampled_with_datetime_PID01.csv phone_battery_episodes_resampled_with_datetime_PID02.csv

JulioV commented 1 year ago

Thanks for reporting this @zacharyfried can you attach your config file and the raw battery csv file for PID01 so we can reproduce the pipeline from the beginning, please?

@Meng6 @jenniferfedor this is related to how we read the episodes file in chunks and save an empty output file if the input file is empty. I'm not sure why the columns are being shuffled when the input file is not empty tho'. @jenniferfedor can you try to reproduce the problem on your end once we have the input and config files please?

zacharyfried commented 1 year ago

config.txt phone_battery_raw.csv

I used plain text for the config file so that I could upload it here. Saving it as yaml renders it usable again.

jenniferfedor commented 1 year ago

Thank you, @zacharyfried! I will look into this.

zacharyfried commented 1 year ago

@jenniferfedor

I was wondering if there are any updates to this issue. Thank you!

jenniferfedor commented 1 year ago

Hi @zacharyfried, thanks again for reporting this issue and apologies for the delay in getting back to you! Unfortunately I haven’t been able to reproduce this problem. I’m using Ubuntu 20.04, R 4.0.3, and the latest version of RAPIDS although I did have to manually update R packages cli to version 2.3.0 and pillar to version 1.7.0 to process these data (we are hoping to release a fix for some other bugs soon that incorporates updates to our renv.lock file, so hopefully that won't be necessary going forward).

Looking further at the two phone_battery_episodes_resampled_with_datetime_*.csv files you shared with us, it seems like all of the columns in PID01’s file are actually in the same order as the columns in PID02’s file and some of the column names in PID01’s file have just been shifted (e.g., the fourth column in both files contains battery level data but is mislabeled as battery_status in PID01’s file). Specifically in PID01’s file, column names device_id through timestamp seem to have been shifted left by one and local_timezone appears after timestamp rather than at the beginning; local_datetime through assigned_segments are correctly labeled. Oddly, the columns are correctly labeled but in a slightly different order (specifically, episode_id and device_id are swapped) in the phone_battery_episodes_resampled_with_datetime.csv file I get when I process the raw battery data for PID01 (the same snippet of rows that were in the original file are attached).

@JulioV and @Meng6, do you have any thoughts or suggestions? I'm wondering if this could potentially be related to how we append processed chunks of data to the output CSV/if columns are somehow in different orders in different chunks? Thanks!

phone_battery_episodes_resampled_with_datetime_PID01_jf.csv

JulioV commented 1 year ago

It could be a problem with an old version of readr (library that implements read_csv_chunked). @jenniferfedor what version of readr are you using after the update?

@zacharyfried what version of readr are you using? If it is not 2.1.3 can you update and try again

jenniferfedor commented 1 year ago

Thanks, @JulioV! I'm using readr 1.4.0.

zacharyfried commented 1 year ago

Thanks for looking into it @jenniferfedor. It does seem like those columns are switched around, but I am unsure how to describe the origin of the values in column 3 (labeled as battery_level) of phone_battery_episodes_resampled_with_datetime_PID01.csv that I uploaded. The values range from 6620 to 6623 and don’t seem to be close to anything that would be expected in any of the columns.

@JulioV I was also using readr version 1.4.0. After updating to the newest version of readr, the run will hang at a seemingly random participant on the rule “resample_episodes_with_datetime”. On each run, this rule will be successfully completed for multiple participants, with the corresponding output csv and .done files for that rule appearing to be normal in the interim data directory for those participants. Investigation into the expected output files for the participant that the run hangs on shows no .done file and an incomplete phone_battery_episodes_resampled_with_datetime.csv. This csv file has the expected columns and contains several thousand rows of data, but is cut off at some point. Sometimes this csv file will have the first 50% of the rows expected given the length of phone_battery_episodes_resampled.csv, but others will have less than 10% before the run hangs. The Value Error that I described before still occurs if the rule “resample_episodes_with_datetime” is reached for the group of participants that was causing this error before updating readr.

This hanging also occurs with other behavioral features/providers besides battery. For example, the run hangs in an analogous way during rule “phone_readable_datetime” when computing with the Doryab Bluetooth provider. The expected output phone_bluetooth_with_datetime.csv only has a fraction of the expected rows given the length of the input phone_bluetooth_raw.csv, though all columns appear intact. Downgrading to readr 1.4.0 fixes the hang.

Below is what the output during a hang looks like when attempting to compute the battery behavioral feature with the RAPIDS provider:

[Thu Dec 15 17:23:57 2022]
rule resample_episodes_with_datetime:
    input: data/interim/PID04/phone_battery_episodes_resampled.csv, data/interim/time_segments/PID04_time_segments.csv, data/external/participant_files/PID04.yaml, data/external/multiple_timezones.csv
    output: data/interim/PID04/phone_battery_episodes_resampled_with_datetime.csv, data/interim/PID04/phone_battery_episodes_resampled_with_datetime.done
    jobid: 1553
    wildcards: pid=PID04, sensor=phone_battery
    resources: tmpdir=/tmp

── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.5
✔ tidyr   1.1.2     ✔ stringr 1.4.0
✔ readr   2.1.3     ✔ forcats 0.5.0
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Warning message:
replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘pillar’

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

We have also been getting the following readr warning:

Warning message:
The following named parsers don't match the column names: local_date_time, local_start_date_time, local_end_date_time

It seems like there could be an invisible character before the column names according to this thread (https://stackoverflow.com/questions/42933931/parser-does-not-match-column-name-in-csv-file-when-importing-using-readr-packag), though I am not sure where this would be. We have noticed this readr warning before in runs with other behavioral features/providers, but it doesn’t seem to affect the output or ability of the run to complete without errors/hanging. I thought it is worth mentioning given that there is some suspicion that readr is related to our error.

Please let me know if you would like me to upload any specific files. Thank you.

JulioV commented 1 year ago

Thanks for the extra info, I'll debug on my side this week

JulioV commented 1 year ago

@zacharyfried I think this might also be related to time zones because, just as @jenniferfedor, I could not reproduce the problem with your config file (set to EST time zone) and the test battery data for P01. I tried with the default periodic segments (three days, daily, morning, etc.) and all of them extracted battery features between 2021-01-07 and 2021-04-02.

Can you:

  1. Test that running P01 with a single time zone (EST) fixes the crash?
  2. Share data/external/multiple_timezones.csv with us?

BTW, thanks for reporting the warning. It is not producing erroneous data but we just opened an issue to remove it.

zacharyfried commented 1 year ago

@JulioV Thank you for looking into this.

  1. Using a single time zone (EST) fixes the Value Errors that occurred with two of our participants, including P01.
  2. I have attached the multiple_timezones.csv, though I have randomized the device_id to avoid sharing sensitive data. The device_ids for the participants that were causing the value errors are device_766239 (P01) and device_405669. The rest of the devices/participants do not cause the error. modified_multiple_timezones.csv

I also noted that while the Value Error no longer occurs while using a single time zone, the run will hang as before when using the most recent readr version (2.1.3). I had to use a single time zone and readr 1.4.0 in order to successfully run complete this run.

Lastly, these two participants seem to cause two different errors when attempting to compute the Location using either the Doryab or Barnett providers with multiple time zones. As with battery above, these errors do not occur when using a single time zone. Both participants can cause either error depending on what provider is being computed and which participant is computed first when running RAPIDS.

The error that occurs during rule phone_locations_barnett_daily_features with participants like P01:

[Tue Jan 17 19:49:25 2023]
rule phone_locations_barnett_daily_features:
    input: data/interim/P01/phone_locations_processed_with_datetime.csv, data/interim/time_segments/P01_time_segments_labels.csv
    output: data/interim/P01/phone_locations_barnett_daily.csv
    jobid: 11
    wildcards: pid=P01
    resources: tmpdir=/tmp

Error in max(location$timestamp) - min(location$timestamp) :
  non-numeric argument to binary operator
Calls: barnett_daily_features
Execution halted
[Tue Jan 17 19:49:26 2023]
Error in rule phone_locations_barnett_daily_features:
    jobid: 11
    output: data/interim/P01/phone_locations_barnett_daily.csv

RuleException:
CalledProcessError in line 411 of /data/rapids_covid_data/clean_rapids/rapids/rules/features.smk:
Command 'set -euo pipefail;  Rscript --vanilla /data/rapids_covid_data/clean_rapids/rapids/.snakemake/scripts/tmp3zhigygn.daily_features.R' returned non-zero exit status 1.
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/features.smk", line 411, in __rule_phone_locations_barnett_daily_features
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-01-17T194905.195831.snakemake.log

The error that occurs during rule phone_locations_add_doryab_extra_columns with participants like P01:

[Tue Jan 17 20:19:33 2023]
rule phone_locations_add_doryab_extra_columns:
    input: data/interim/P01/phone_locations_processed_with_datetime.csv
    output: data/interim/P01/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv
    jobid: 1
    wildcards: pid=P01
    resources: tmpdir=/tmp

Traceback (most recent call last):
  File "/data/rapids_covid_data/clean_rapids/rapids/.snakemake/scripts/tmphrhpilal.add_doryab_extra_columns.py", line 125, in <module>
    location_data["duration_in_seconds"] = -1 * location_data.timestamp.diff(-1) / 1000
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pandas/core/series.py", line 2648, in diff
    result = algorithms.diff(self._values, periods)
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/site-packages/pandas/core/algorithms.py", line 1711, in diff
    out_arr[res_indexer] = op(arr[res_indexer], arr[lag_indexer])
TypeError: unsupported operand type(s) for -: 'str' and 'str'
[Tue Jan 17 20:19:34 2023]
Error in rule phone_locations_add_doryab_extra_columns:
    jobid: 1
    output: data/interim/P01/phone_locations_processed_with_datetime_with_doryab_columns_episodes.csv

RuleException:
CalledProcessError in line 387 of /data/rapids_covid_data/clean_rapids/rapids/rules/features.smk:
Command 'set -euo pipefail;  /data/mamba/mambaforge/envs/rapids_r4_0/bin/python3.7 /data/rapids_covid_data/clean_rapids/rapids/.snakemake/scripts/tmphrhpilal.add_doryab_extra_columns.py' returned non-zero exit status 1.
  File "/data/rapids_covid_data/clean_rapids/rapids/rules/features.smk", line 387, in __rule_phone_locations_add_doryab_extra_columns
  File "/data/mamba/mambaforge/envs/rapids_r4_0/lib/python3.7/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-01-17T201911.140808.snakemake.log

From what I can tell, the multiple_timezones.csv file seems normal for these participants, though it does seem to be causing the error given that using a single time zone fixes it. Please let me know if there is anything else I can provide. Thank you very much.

zacharyfried commented 1 year ago

There was some loss of significant figures in the timestamp column of that CSV upload. The file with the correct significant figures is below: multi_timezones.csv

JulioV commented 1 year ago

Good news @zacharyfried! We found the bug. The problem was that we were creating data with columns in the wrong order when some sensor data could not be assign to a time zone in TZCODES_FILE. In your case this happened because P01's timezone entry started on 1613606414017 but your battery data started earlier on 1610070099848.

You can wait for this PR to be merged or update your timezone's timestamp for P01. If your participant actually was on a different timezone before 1613606414017 then you can add a line before the current one in TZCODES_FILE with timestamp set to 0 or 1610070099847 (note the -1 compared to the first battery timestamp).

Let us know if we can close this bug report.

zacharyfried commented 1 year ago

This fixed it, thank you very much! I updated the timestamp for P01 and was able to compute battery and locations. I think this bug report can be closed.