Optimize memory usage in readable_datetime.R script

Meng6 commented 2 years ago

Hi @JulioV, I updated our readable_datetime.R script to reduce memory usage. Previous script requires a very large memory while processing a large table (e.g. accelerometer data). Instead of loading the whole raw CSV file into memory, the updated script will only load and process a file in chunks (process 10k rows per time).

Besides reading a file in chunks, there are two main updates in readable_datetime.R script:

Sorting by timestamps column is discarded as it is done while pulling phone/wearable data.
Adding a flag file to output. Since a file is processed (read, add date time, and write) in chunks, a file named _with_datetime.csv will be generated at the first chunk and kept adding new rows in the rest chunks.

JulioV commented 2 years ago

Thanks Meng! I have a couple of comments in my review

Meng6 commented 2 years ago

Hi @JulioV, thanks so much for your suggestions and comments! I updated the readable_datetime.R script and replied part of your comments in review (see above).

Can we also check that we order by timestamp when we pull the data independently of the data stream?

I am afraid not. Since we read a file in chunks, we can only check if we order by timestamp within each chunk. However, we are not sure about the order between chunks.

JulioV commented 2 years ago

I'm not sure we can rely on sorted chunks only. Some of our feature code assumes the input is fully sorted doesn't it? Also, all the feature code and the data yield functions load the entire input csv in memory, is the idea to change that code too?

JulioV commented 2 years ago

I just remembered we sort the entire sensor data csv in memory when we pull it. It should be ok to merge this but we should keep in mind that we might need to switch from csv to SQLite or Spark if we want to fully support larger than memory datasets throughout the entire pipeline.

Meng6 commented 2 years ago

Hi @JulioV , sorry for the late response. I missed your message. There are 4 types of input for readable_datetime.R script and all of them are sorted:

raw sensor data: we sort the entire sensor data csv in memory when we pull it:
- https://github.com/carissalow/rapids/blob/master/src/data/streams/pull_phone_data.R#L188
- https://github.com/carissalow/rapids/blob/master/src/data/streams/pull_wearable_data.R#L149
phone_yielded_timestamps.csv: this table is sorted by timestamp column
- https://github.com/carissalow/rapids/blob/master/src/data/phone_yielded_timestamps.R#L17
phone_locations_processed.csv: this table is sorted by timestamp column
- https://github.com/carissalow/rapids/blob/master/src/data/process_location_types.R#L51
resampled data: duplicate rows based on nrow column, which keeps the order of episodes data
- https://github.com/carissalow/rapids/blob/master/src/features/utils/resample_episodes.R#L11

I think we can merge for now, and switch from csv to SQLite or Spark as you suggested in the future.

JulioV commented 2 years ago

Sounds good

carissalow / rapids

Optimize memory usage in readable_datetime.R script #181