synthetic data to include more incompleteness

NREL / Wattile

Deep Learning-based Forecasting of Building Energy Consumption

BSD 3-Clause "New" or "Revised" License

9 stars 0 forks source link

synthetic data to include more incompleteness #258

Open JanghyunJK opened 1 year ago

JanghyunJK commented 1 year ago

noticing most of the code development leveraging current synthetic data (or dummy data) might be not capturing some of the important aspects of data processing especially data edge (e.g., right-closed, right label, back-ward window, etc.) consideration.
thought it'd be much better if our synthetic data can be more incomplete so that it can include,
- timestamps varying down to seconds
- irregular measurement intervals
- so something like:
- from:
  - 2019-01-01 09:00:00 value1
  - 2019-01-01 09:01:00 value2
  - 2019-01-01 09:02:00 value3
  - 2019-01-01 09:03:00 value4
  - ...
- to:
  - 2019-01-01 09:01:32 value1
  - 2019-01-01 09:03:03 value2
  - 2019-01-01 09:04:13 value4
  - ...

stephen-frank commented 1 year ago

Would it be useful for me (or @haneslinger) to semi-randomize the Synthetic Data data set to address this issue? What I am thinking is...

Copy the Synthetic Data data set
Keep only 1 day
For each predictor, and the target, create a vector.
For each vector, randomize as follows... a. 20% of data: Remove b. 20% of data: Leave as is c. 60% of data: Modify timestamp by a random number of seconds in the interval (-60,60)
Resort vectors by time
Recombine into a data frame sorted by time with time index
Export as CSV to replace the original predictor and target CSV files

I could do this in SkySpark pretty easily, or I imagine Hannah or you could do it in Python pretty easily. Either way I imagine we would do it once then persist the result as "Messy Synthetic Data" or something like that.

JanghyunJK commented 1 year ago

Things I noticed (when working on data edge processing) that would be useful was actually having a data where data processing with combinations of left/right closed and left/right label would make a difference in output. So, to me, it seemed more easier to have a manual/customized data (or csv) for the edge testing. And although I haven't looked in detail yet, but I'm assuming what you added in the recent PR is reflecting that type of data. And some random data I quickly tried to make changes on the data edge processing was like this: https://github.com/NREL/Wattile/blob/48c97e8c378cfaad270b38fb46e95a03402ae427/notebooks/exploratory/create_random_timeseries_data.ipynb

stephen-frank commented 1 year ago

Ok, yeah, my small test data set is designed to do that. So #263 may be a duplicate of this. My thought with the "messy" synthetic data set described above was more to test the whole workflow with a data set with missing data and irregular timestamps, but I don't know if that is useful or not.