Closed Meng6 closed 2 years ago
Thanks Meng! I have a couple of comments in my review
Hi @JulioV, thanks so much for your suggestions and comments! I updated the readable_datetime.R
script and replied part of your comments in review (see above).
Can we also check that we order by timestamp when we pull the data independently of the data stream?
I am afraid not. Since we read a file in chunks, we can only check if we order by timestamp within each chunk. However, we are not sure about the order between chunks.
I'm not sure we can rely on sorted chunks only. Some of our feature code assumes the input is fully sorted doesn't it? Also, all the feature code and the data yield functions load the entire input csv in memory, is the idea to change that code too?
I just remembered we sort the entire sensor data csv in memory when we pull it. It should be ok to merge this but we should keep in mind that we might need to switch from csv to SQLite or Spark if we want to fully support larger than memory datasets throughout the entire pipeline.
Hi @JulioV , sorry for the late response. I missed your message. There are 4 types of input for readable_datetime.R
script and all of them are sorted:
I think we can merge for now, and switch from csv to SQLite or Spark as you suggested in the future.
Sounds good
Hi @JulioV, I updated our
readable_datetime.R
script to reduce memory usage. Previous script requires a very large memory while processing a large table (e.g. accelerometer data). Instead of loading the whole raw CSV file into memory, the updated script will only load and process a file in chunks (process 10k rows per time).Besides reading a file in chunks, there are two main updates in
readable_datetime.R
script: