carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

UnicodeDecodeError in rule phone_applications_foreground_python_features #195

Closed jenniferfedor closed 1 year ago

jenniferfedor commented 1 year ago

Hi @JulioV and @Meng6, when trying to process phone applications foreground features for some participants, I get the following error:

[Wed Oct 19 17:39:19 2022]  
rule phone_applications_foreground_python_features:  
    input: data/raw/1023/phone_applications_foreground_with_datetime_with_categories.csv, data/interim/1023/phone_app_episodes_resampled_with_datetime.csv, data/interim/time_segments/1023_time_segments_labels.csv  
    output: data/interim/1023/phone_applications_foreground_features/phone_applications_foreground_python_rapids.csv  
    jobid: 11  
    wildcards: pid=1023, provider_key=rapids  

RAPIDS: Processing phone_applications_foreground rapids hourly0000  
Traceback (most recent call last):  
  File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens  
  File "pandas/_libs/parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype  
  File "pandas/_libs/parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert  
  File "pandas/_libs/parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8  
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte  

During handling of the above exception, another exception occurred:

Traceback (most recent call last):  
  File "/rapids/.snakemake/scripts/tmpesfg1fnp.entry.py", line 21, in <module>  
    sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)  
  File "/rapids/src/features/utils/utils.py", line 109, in fetch_provider_features  
    features = feature_function(sensor_data_files, time_segment, provider, filter_data_by_segment=filter_data_by_segment, chunk_episodes=chunk_episodes)  
  File "src/features/phone_applications_foreground/rapids/main.py", line 116, in rapids_features  
    apps_events_data = pd.read_csv(sensor_data_files["sensor_data"])  
  File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 688, in read_csv  
    return _read(filepath_or_buffer, kwds)  
  File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 460, in _read  
    data = parser.read(nrows)  
  File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 1198, in read  
    ret = self._engine.read(nrows)  
  File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 2157, in read  
    data = self._reader.read(nrows)  
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read  
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory  
  File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows  
  File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data  
  File "pandas/_libs/parsers.pyx", line 1126, in pandas._libs.parsers.TextReader._convert_tokens  
  File "pandas/_libs/parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype  
  File "pandas/_libs/parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert  
  File "pandas/_libs/parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8  
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte 

After investigating, it seems that the error occurs with pd.read_csv() on both lines 116 and 124 in src/features/phone_applications_foreground/rapids/main.py and is related to a special character (specifically, "é") in the name of one of the apps used by this participant. Data for participants who did not use this app are processed successfully. Explicitly specifying an encoding (e.g., encoding = "ISO-8859-1") in pd.read_csv() seems to work as a temporary fix.

We've encountered this issue running the latest version of RAPIDS (commit d255f2de) on two machines (macOS Monterey version 12.4 and Ubuntu version 20.04.1).

JulioV commented 1 year ago

@Meng6 a good place to sanitize the input would be the beginning of the main script for this provider no?

Or we could add the file encoding as a provider parameter

Meng6 commented 1 year ago

Hi @JulioV, besides the locations you mentioned above, is it possible to handle it in R script? For example: pull_phone_data rule.

JulioV commented 1 year ago

Hard coding the encoding is ok to process this particular dataset.

The medium term solution is to use a mutation script to force the problematic columns of the app foreground provider into utf with stringi::stri_enc_toutf8. We can publish this fix.

The long term solution is to move away from csv and to feather or parquet files but this will take more work and time.

jenniferfedor commented 1 year ago

Thank you both! @JulioV, for the medium-term solution you suggested, would I create a new mutation script in src/data/streams/mutations/phone/aware and update the aware_*/format.yaml files in src/data/streams? Also, I know we don't currently have providers for phone applications crashes or notifications features, but given that those sensors also have the problematic application_name column, should we implement this fix for them as well in addition to phone applications foreground?

JulioV commented 1 year ago

Yeah, that's the right location for the script and the edits to the format.yaml files. We don't need to implement this fix for the other application sensors for now. When we crate providers for them we can always update the streams

jenniferfedor commented 1 year ago

Sounds good. Thank you, @JulioV!