Open JanghyunJK opened 1 year ago
Would it be useful for me (or @haneslinger) to semi-randomize the Synthetic Data data set to address this issue? What I am thinking is...
I could do this in SkySpark pretty easily, or I imagine Hannah or you could do it in Python pretty easily. Either way I imagine we would do it once then persist the result as "Messy Synthetic Data" or something like that.
Things I noticed (when working on data edge processing) that would be useful was actually having a data where data processing with combinations of left
/right
closed and left
/right
label would make a difference in output. So, to me, it seemed more easier to have a manual/customized data (or csv) for the edge testing. And although I haven't looked in detail yet, but I'm assuming what you added in the recent PR is reflecting that type of data. And some random data I quickly tried to make changes on the data edge processing was like this: https://github.com/NREL/Wattile/blob/48c97e8c378cfaad270b38fb46e95a03402ae427/notebooks/exploratory/create_random_timeseries_data.ipynb
Ok, yeah, my small test data set is designed to do that. So #263 may be a duplicate of this. My thought with the "messy" synthetic data set described above was more to test the whole workflow with a data set with missing data and irregular timestamps, but I don't know if that is useful or not.
noticing most of the code development leveraging current synthetic data (or dummy data) might be not capturing some of the important aspects of data processing especially data edge (e.g., right-closed, right label, back-ward window, etc.) consideration.
thought it'd be much better if our synthetic data can be more incomplete so that it can include,
timestamps varying down to seconds
irregular measurement intervals
so something like:
from:
2019-01-01 09:00:00 value1
2019-01-01 09:01:00 value2
2019-01-01 09:02:00 value3
2019-01-01 09:03:00 value4
to:
2019-01-01 09:01:32 value1
2019-01-01 09:03:03 value2
2019-01-01 09:04:13 value4