Closed jenniferfedor closed 1 year ago
@Meng6 a good place to sanitize the input would be the beginning of the main script for this provider no?
Or we could add the file encoding as a provider parameter
Hi @JulioV, besides the locations you mentioned above, is it possible to handle it in R script? For example: pull_phone_data rule.
Hard coding the encoding is ok to process this particular dataset.
The medium term solution is to use a mutation script to force the problematic columns of the app foreground provider into utf with stringi::stri_enc_toutf8. We can publish this fix.
The long term solution is to move away from csv and to feather or parquet files but this will take more work and time.
Thank you both! @JulioV, for the medium-term solution you suggested, would I create a new mutation script in src/data/streams/mutations/phone/aware
and update the aware_*/format.yaml
files in src/data/streams
? Also, I know we don't currently have providers for phone applications crashes or notifications features, but given that those sensors also have the problematic application_name
column, should we implement this fix for them as well in addition to phone applications foreground?
Yeah, that's the right location for the script and the edits to the format.yaml files. We don't need to implement this fix for the other application sensors for now. When we crate providers for them we can always update the streams
Sounds good. Thank you, @JulioV!
Hi @JulioV and @Meng6, when trying to process phone applications foreground features for some participants, I get the following error:
After investigating, it seems that the error occurs with
pd.read_csv()
on both lines 116 and 124 insrc/features/phone_applications_foreground/rapids/main.py
and is related to a special character (specifically, "é") in the name of one of the apps used by this participant. Data for participants who did not use this app are processed successfully. Explicitly specifying an encoding (e.g.,encoding = "ISO-8859-1"
) inpd.read_csv()
seems to work as a temporary fix.We've encountered this issue running the latest version of RAPIDS (commit
d255f2de
) on two machines (macOS Monterey version 12.4 and Ubuntu version 20.04.1).