carbonfirst / CarbonCast

A system to predict hourly carbon intensity in the electrical grids using machine learning. CarbonCast provides average carbon intensity forecasts for up to 96 hours.
Apache License 2.0
33 stars 8 forks source link

Data Collection #6

Closed KonstantinosChasiotis closed 4 days ago

KonstantinosChasiotis commented 2 months ago

Hello, I have a query regarding the data collection process for this project. Specifically, when examining files like DE_lifecycle_emissions.csv, I noticed that you have recorded electricity generation per production type for every hour since 2020. However, upon reviewing the data you've collected and those available on the ENTSOE platform, disparities emerged. For instance, on January 1, 2020, at 00:00:00 UTC for Germany, the biomass electricity generation in your CSV is noted as 22,597, whereas the aggregated figure on the ENTSOE platform totals 19,555 (4,896 + 4,877 + 4,886 + 4,896) for that hour.

Screenshot 2024-05-05 at 4 04 37 PM

Upon reading your paper and GitHub documentation, I couldn't discern how these data were gathered. Consequently, I'm curious if any processing steps were involved to arrive at values such as 22,597, given the discrepancies with those recorded on the ENTSOE platform. I observed that such differences include with all other figures of production type and thus I was wondering if you could provide some clarification on how to reach the same values as you have

Kind regards, Konstantinos Chasiotis

diptyaroop commented 3 weeks ago

Hi Konstantinos,

Apologies for the late reply.

For biomass, we aggregate the numbers in the "biomass" and "waste" columns. Hence, it is around 22k & not 19k as you have observed. In general, we follow this file to aggregate the sources (lines 96-107). So, biomass = biomass + waste, wind = wind offshore + wind onshore, etc.

Even after that, there are still minor discrepancies (for example, now that I calculate, I see that the numbers add up to 22819 & not 22597). I think the following may be the reason: For our initial version of CarbonCast (v2.1), we actually modified this script from Electricity Maps, which fetches the data from ENTSOE. Either the data has been slightly updated since we fetched it, or there may have been some minor bug in our aggregation script that caused this discrepancy. However, the discrepancy is very small (e.g., 22597 vs 22819).

If you look at our newer branches (e.g., v3.1), we have shifted from using that script to using our own parser for collecting data from ENTSOE (refer src/entsoeParser.py). The collected data is in EU_DATA/ folder, & it is consistent with what's available in the platform currently (e.g., biomass is 22819 for 1st Jan 2020). The performance of the models is similar to what we reported in our original paper.

I hope this clarifies your doubt. If you have any other questions, please feel free to email me at dmaji@cs.umass.edu

diptyaroop commented 4 days ago

Issues resolved.