electricitymaps / electricitymaps-contrib

A real-time visualisation of the CO2 emissions of electricity consumption
https://app.electricitymaps.com
GNU Affero General Public License v3.0
3.51k stars 934 forks source link

Chile inconsistency between live and historical data acquisition #2539

Closed corradio closed 4 years ago

corradio commented 4 years ago

Our systems use historical data to train forecasts. When run in production, those forecasts require live data. Therefore, the two datafeeds need to be consistent. I think the Chile parser uses two different datafeeds that aren't consistent, and thus our forecasts might be inconsistent. If in doubt, we should only implement the real-time feed to avoid our database having both data mixed.

jarek commented 4 years ago

Can you post an example to let us know how inconsistent the data is? Maybe there's a bug in one of the parsers?

corradio commented 4 years ago

Can you post an example to let us know how inconsistent the data is? Maybe there's a bug in one of the parsers?

I don't know for sure that the data is different, but from looking at the code, it uses two different URLs so potentially the data is different. I created the issue to get some certainty around this.

pierresegonne commented 4 years ago

The following comparison for live vs historical data relies on live data spanning the timeframe 2020-06-10 04:00:00 to 2020-06-18 06:10:00 (a bit short but as the parsers were down, that's the best we can do right now) and on historical data spanning the timeframe 2020-05-27 00:00:00 to 2020-06-13 14:00:00 (roughly spanning the missing data in DB)

It is first important to note that the historical data presents production data for wind, solar, geothermal, hydro and unknown (mix of fossil) while the live data only has data for wind, solar and unknown. Therefore, if both API were kept, standardisation must be rolled out.

Box Plots Historical Data hist_bp_new Live Data live_bp

Time Series Historical Data hist_ts_new Live Data live_ts

In black is the overall production mean. The inconsistency in the leftmost part of the live data time series is due to a different time scale (shift between one data point every 10 mins to 1 every hour) but I don't know why this shift is present in the first place.

Basic Moments Historical Data

Production Type Mean Std
All 8743.18 856.93
Wind 512.44 211.90
Solar 578.69 722.71
Unknown 7652.05 774.66

Live Data

Production Type Mean Std
All 8461.54 940.50
Wind 790.85 266.34
Solar 413.46 583.30
Unknown 7257.22 917.03

So my conclusion would be that the sources are consistent if hydro and geothermal are aggregated into unknown for the historical data.

For reference, here is the box plot for all historical production types. hist_bp

It might be worth investigating this shift in time scale though, even though I would assume it would not really a problem for the forecasting, as interpolation is executed, is that right @corradio ?

corradio commented 4 years ago

Indeed I don't see any problems as the data pipeline makes sure everything is aligned on an hourly basis. I think the solution to lump together hydro and geothermal into unknown seems acceptable. However, the emission factor for unknown assumes fossil fuel (700g/kWh if I remember correctly). Do you know how big the hydro and geothermal part is compared to the other part of the unknown (probably thermal load)? We might need to readjust the emission factor of unknown for Chile based on an average breakdown in case hydro+geo is substantial compared to the other thermal unknown.

pierresegonne commented 4 years ago

Ok if we base our re-evaluation of the emission factors on averages, I think we should base our computation on capacidad_y_generación_2020_5 (1).xlsx (that I found here) Which is a yearly (month by month) average of production by type

The production averages are for 2019 + 2020 up to April Type Avg % Total % unknown
Hydro 1709 26.4 30.9
Other Fossil Fuel 190 2.9 3.4
Oil 28 0.4 0.5
Coal 2361 36.4 42.7
Natural Gas 1227 18.9 22.2
Wind 396 6.1 -
Solar 548 8.5 -
Geothermal 18 0.3 0.3

Consequently I suggest using the emission factors

other fossil: 650 oil: 650 coal: 820 gas: 490 geothermal: 38 hydro: 24

and compute the weighted average for the emission factor of unknown, which gives 491g/kWh.

I will submit the PR that will solve this issue soon