Closed marcopritoni closed 6 years ago
I'm pretty sure this is because some of the data points were written twice while we were developing the Green Button data ingester. You should be able to get rid of it when you insert into a DataFrame (and in which case https://github.com/SoftwareDefinedBuildings/XBOS/issues/55 should fix it)
Sum is mean times count
"Sum is mean times count" Not if you have duplicated data points (same index, same data). E.g. 2017-01-01 00:00:00-08:00, 1120.0 2017-01-01 00:00:00-08:00 1120.0 ... (100 times)
MDAL mean (15min): 1120.0 kWh MDAL count (15min): 100 pandas sum(15min): 112,000 kWh - Not correct No way of knowing from pandas which points are duplicated
Downloading the raw data with MDAL and doing all this in pandas has produced another issue that we are looking into.
You can definitely drop duplicated rows with the same index in pandas. In our scenario, we aren't going to have different values for the same index (timestamp), so this strategy should work.
@marcopritoni you should also make a note of which streams have duplicate points so we can clean them up later.
Sure I can make a list. Do you want me to add it here or keep it offline?
Offline would be better; thanks! It's pretty easy to fix the streams to remove duplicates (I already have 90% of the script done). Maybe you could make a spreadsheet and add the streams to there so I can mark them off when they're done? Shoot me an email
Just noticed that greenbutton data from XBOS downloaded as mdal.RAW data, has multiple (all?) points with the same timestamp (and value). Not sure if the issue is in BTrDB or MDAL. I would imagine we do not want to have the same timestamp for multiple points.
Example: {'4d95d5ce-de62-3449-bd58-4dcad75b526d': 2017-01-01 00:00:00-08:00 1.6395 2017-01-01 00:00:00-08:00 1.6395 2017-01-01 00:15:00-08:00 0.9959 2017-01-01 00:15:00-08:00 0.9959 2017-01-01 00:30:00-08:00 1.6222 2017-01-01 00:30:00-08:00 1.6222 2017-01-01 00:45:00-08:00 1.6374 2017-01-01 00:45:00-08:00 1.6374 ... } I need to download this as raw, because it's energy (kWh) and not power and each reading should be summed and the existing stats aggregation functions (mean, max, min, count) do not support it.