Closed corradio closed 4 years ago
Can you post an example to let us know how inconsistent the data is? Maybe there's a bug in one of the parsers?
Can you post an example to let us know how inconsistent the data is? Maybe there's a bug in one of the parsers?
I don't know for sure that the data is different, but from looking at the code, it uses two different URLs so potentially the data is different. I created the issue to get some certainty around this.
The following comparison for live vs historical data relies on live data spanning the timeframe 2020-06-10 04:00:00
to 2020-06-18 06:10:00
(a bit short but as the parsers were down, that's the best we can do right now) and on historical data spanning the timeframe 2020-05-27 00:00:00
to 2020-06-13 14:00:00
(roughly spanning the missing data in DB)
It is first important to note that the historical data presents production data for wind
, solar
, geothermal
, hydro
and unknown
(mix of fossil) while the live data only has data for wind
, solar
and unknown
. Therefore, if both API were kept, standardisation must be rolled out.
Box Plots Historical Data Live Data
Time Series Historical Data Live Data
In black is the overall production mean. The inconsistency in the leftmost part of the live data time series is due to a different time scale (shift between one data point every 10 mins to 1 every hour) but I don't know why this shift is present in the first place.
Basic Moments Historical Data
Production Type | Mean | Std |
---|---|---|
All | 8743.18 | 856.93 |
Wind | 512.44 | 211.90 |
Solar | 578.69 | 722.71 |
Unknown | 7652.05 | 774.66 |
Live Data
Production Type | Mean | Std |
---|---|---|
All | 8461.54 | 940.50 |
Wind | 790.85 | 266.34 |
Solar | 413.46 | 583.30 |
Unknown | 7257.22 | 917.03 |
So my conclusion would be that the sources are consistent if hydro and geothermal are aggregated into unknown for the historical data.
For reference, here is the box plot for all historical production types.
It might be worth investigating this shift in time scale though, even though I would assume it would not really a problem for the forecasting, as interpolation is executed, is that right @corradio ?
Indeed I don't see any problems as the data pipeline makes sure everything is aligned on an hourly basis. I think the solution to lump together hydro and geothermal into unknown seems acceptable. However, the emission factor for unknown assumes fossil fuel (700g/kWh if I remember correctly). Do you know how big the hydro and geothermal part is compared to the other part of the unknown (probably thermal load)? We might need to readjust the emission factor of unknown for Chile based on an average breakdown in case hydro+geo is substantial compared to the other thermal unknown.
Ok if we base our re-evaluation of the emission factors on averages, I think we should base our computation on capacidad_y_generación_2020_5 (1).xlsx (that I found here) Which is a yearly (month by month) average of production by type
The production averages are for 2019 + 2020 up to April | Type | Avg | % Total | % unknown |
---|---|---|---|---|
Hydro | 1709 | 26.4 | 30.9 | |
Other Fossil Fuel | 190 | 2.9 | 3.4 | |
Oil | 28 | 0.4 | 0.5 | |
Coal | 2361 | 36.4 | 42.7 | |
Natural Gas | 1227 | 18.9 | 22.2 | |
Wind | 396 | 6.1 | - | |
Solar | 548 | 8.5 | - | |
Geothermal | 18 | 0.3 | 0.3 |
Consequently I suggest using the emission factors
other fossil: 650 oil: 650 coal: 820 gas: 490 geothermal: 38 hydro: 24
and compute the weighted average for the emission factor of unknown, which gives 491g/kWh.
I will submit the PR that will solve this issue soon
Our systems use historical data to train forecasts. When run in production, those forecasts require live data. Therefore, the two datafeeds need to be consistent. I think the Chile parser uses two different datafeeds that aren't consistent, and thus our forecasts might be inconsistent. If in doubt, we should only implement the real-time feed to avoid our database having both data mixed.