BTrDB duplicate datapoints with same index/MDAL issue ?

SoftwareDefinedBuildings / XBOS

The eXtensible Building Operating System

BSD 2-Clause "Simplified" License

28 stars 18 forks source link

BTrDB duplicate datapoints with same index/MDAL issue ? #56

Closed marcopritoni closed 6 years ago

marcopritoni commented 6 years ago

Just noticed that greenbutton data from XBOS downloaded as mdal.RAW data, has multiple (all?) points with the same timestamp (and value). Not sure if the issue is in BTrDB or MDAL. I would imagine we do not want to have the same timestamp for multiple points.

Example: {'4d95d5ce-de62-3449-bd58-4dcad75b526d': 2017-01-01 00:00:00-08:00 1.6395 2017-01-01 00:00:00-08:00 1.6395 2017-01-01 00:15:00-08:00 0.9959 2017-01-01 00:15:00-08:00 0.9959 2017-01-01 00:30:00-08:00 1.6222 2017-01-01 00:30:00-08:00 1.6222 2017-01-01 00:45:00-08:00 1.6374 2017-01-01 00:45:00-08:00 1.6374 ... } I need to download this as raw, because it's energy (kWh) and not power and each reading should be summed and the existing stats aggregation functions (mean, max, min, count) do not support it.

gtfierro commented 6 years ago

I'm pretty sure this is because some of the data points were written twice while we were developing the Green Button data ingester. You should be able to get rid of it when you insert into a DataFrame (and in which case https://github.com/SoftwareDefinedBuildings/XBOS/issues/55 should fix it)

immesys commented 6 years ago

Sum is mean times count

marcopritoni commented 6 years ago

"Sum is mean times count" Not if you have duplicated data points (same index, same data). E.g. 2017-01-01 00:00:00-08:00, 1120.0 2017-01-01 00:00:00-08:00 1120.0 ... (100 times)

MDAL mean (15min): 1120.0 kWh MDAL count (15min): 100 pandas sum(15min): 112,000 kWh - Not correct No way of knowing from pandas which points are duplicated

Downloading the raw data with MDAL and doing all this in pandas has produced another issue that we are looking into.

gtfierro commented 6 years ago

You can definitely drop duplicated rows with the same index in pandas. In our scenario, we aren't going to have different values for the same index (timestamp), so this strategy should work.

@marcopritoni you should also make a note of which streams have duplicate points so we can clean them up later.

marcopritoni commented 6 years ago

Sure I can make a list. Do you want me to add it here or keep it offline?

gtfierro commented 6 years ago

Offline would be better; thanks! It's pretty easy to fix the streams to remove duplicates (I already have 90% of the script done). Maybe you could make a spreadsheet and add the streams to there so I can mark them off when they're done? Shoot me an email

gtfierro commented 6 years ago

https://github.com/SoftwareDefinedBuildings/XBOS/issues/58 should help