NREL / Wattile

Deep Learning-based Forecasting of Building Energy Consumption
BSD 3-Clause "New" or "Revised" License
9 stars 0 forks source link

synthetic data to include more incompleteness #258

Open JanghyunJK opened 1 year ago

JanghyunJK commented 1 year ago
stephen-frank commented 1 year ago

Would it be useful for me (or @haneslinger) to semi-randomize the Synthetic Data data set to address this issue? What I am thinking is...

  1. Copy the Synthetic Data data set
  2. Keep only 1 day
  3. For each predictor, and the target, create a vector.
  4. For each vector, randomize as follows... a. 20% of data: Remove b. 20% of data: Leave as is c. 60% of data: Modify timestamp by a random number of seconds in the interval (-60,60)
  5. Resort vectors by time
  6. Recombine into a data frame sorted by time with time index
  7. Export as CSV to replace the original predictor and target CSV files

I could do this in SkySpark pretty easily, or I imagine Hannah or you could do it in Python pretty easily. Either way I imagine we would do it once then persist the result as "Messy Synthetic Data" or something like that.

JanghyunJK commented 1 year ago

Things I noticed (when working on data edge processing) that would be useful was actually having a data where data processing with combinations of left/right closed and left/right label would make a difference in output. So, to me, it seemed more easier to have a manual/customized data (or csv) for the edge testing. And although I haven't looked in detail yet, but I'm assuming what you added in the recent PR is reflecting that type of data. And some random data I quickly tried to make changes on the data edge processing was like this: https://github.com/NREL/Wattile/blob/48c97e8c378cfaad270b38fb46e95a03402ae427/notebooks/exploratory/create_random_timeseries_data.ipynb

stephen-frank commented 1 year ago

Ok, yeah, my small test data set is designed to do that. So #263 may be a duplicate of this. My thought with the "messy" synthetic data set described above was more to test the whole workflow with a data set with missing data and irregular timestamps, but I don't know if that is useful or not.