Add Marktstammdatenregister (MaStR)

lkstrp commented 3 months ago

Closes #16

Change proposed in this Pull Request

Adds Marktstammdatenregister via open-MaStR.

There are a few issues:

open-mastr provides a bulk download of all the cleaned datasets on zenodo. But as a .zip, so we have to download everything. We could use the API instead, but then the user has to pass a token.
These datasets are huge with many small power plants. I have now filtered out all plants with a capacity of less than 1 MW. Otherwise powerplant.aggregate_units() takes too long. Solar and wind are also currently not included.
- Performance can be improved I think, but the main bottleneck is probably Duke and not on ours side
Validation is not done yet, I wait for the ENTSOE token to run compare-with-entsoe-stats.py, but below is a first plot

Dataset File Name	Number of entrys	Entrys with less than 1 MW capacity
_biomass.csv	22284	21240 (95.32%)
_combustion.csv	85424	81776 (95.73%)
_nuclear.csv	6	0 (0.00%)
_hydro.csv	8657	7859 (90.78%)
_wind.csv	34798	6729 (19.34%)

output

Type of change

[x] New feature (non-breaking change which adds functionality)

Checklist

[ ] I have added a note to release notes doc/release_notes.rst.
[ ] I have used pre-commit run --all to lint/format/check my contribution
[ ] I have documented the effects of my code changes in the documentation doc/.
[ ] I have adjusted the docstrings in the code appropriately.

FlorianK13 commented 1 month ago

Hi @lkstrp and other devs from powerplantmatching, I'm one of the developers of open-mastr. I like your work in harmonizing different sources for one european dataset. If there are issues from your side that are of concern for the open-mastr development, I'm happy to discuss them.

One remark on your comment above: "We could use the API instead, but then the user has to pass a token." This is not really a good idea. With the API you are limited to a small number of requests per day, so using it to get large data takes a long time. You could however run the bulk download to get an sqlite or postgres database and extract relevant information from there.

from open_mastr import Mastr

db = Mastr()
db.download()
# if you want csv files then also run
db.to_csv()

lkstrp commented 1 month ago

Hey @FlorianK13, Thanks for reaching out!

So far the idea was to basically just use the zenodo download you provide, which is quite time consuming to download.

from open_mastr import Mastr

db = Mastr()
db.download()
# if you want csv files then also run
db.to_csv()

Does this approach have any advantages over the zenodo download? E.g. runs faster, allows downloading only selected data? The API reference reads like it downloads the same zip in bulk, but allows data selection. Which means it downloads everything and just strips away unselected data?

FlorianK13 commented 1 month ago

When using the python download method, you will get the most recent data (from the day before). On zenodo you will get the data from our last update, which is a few month old. However with zenodo your code is reproducible, as the python download changes every day as the dataset from BNetzA changes every day. To achieve reproducibilty with python, you would need to specify date="existing" (Reference) after you have downloaded the dataset once so that you use your existing local dataset from there on.

Both approaches take rather long, as you need to download the whole dataset. Afterwards you can specify which data you are interested to parse. So you are right with your last sentence 'Which means it downloads everything and just strips away unselected data.'

fneum commented 3 weeks ago

open-mastr provides a bulk download of all the cleaned datasets on zenodo. But as a .zip, so we have to download everything. We could use the API instead, but then the user has to pass a token.

Based on the discussion above, let's take the zenodo releases. If that's updated at least on an annual basis, that's fine. I am also not too worried about the large download size, as it is usually not a frequent action to update it and it's cached locally as well. @FlorianK13, it could be an option for upcoming releases to upload the individual CSV files unzipped into the zenodo repository, which would allow selective downloads (even though you lose the ZIP compression). This could be additional to the ZIP.

These datasets are huge with many small power plants. I have now filtered out all plants with a capacity of less than 1 MW. Otherwise powerplant.aggregate_units() takes too long. Solar and wind are also currently not included.

Yes, that's also what Global Energy Monitor does. Perhaps they will also integrate open-MaStR, then we wouldn't have to.

Validation is not done yet, I wait for the ENTSOE token to run compare-with-entsoe-stats.py, but below is a first plot

I got one on the same day I requested it today.

FlorianK13 commented 2 weeks ago

@fneum I created https://github.com/OpenEnergyPlatform/open-MaStR/issues/558 to discuss if we can upload single files at zenodo.

PyPSA / powerplantmatching