EnergieID / entsoe-py

Python client for the ENTSO-E API (european network of transmission system operators for electricity)
MIT License
430 stars 189 forks source link

'year_limited' is skipping (sometimes?) the first row of the frames leanding to missing row in final results #363

Open sdementen opened 3 days ago

sdementen commented 3 days ago

To reproduce the bug

import pandas
from entsoe import EntsoePandasClient

client = EntsoePandasClient(api_key="397870bf-afdc-4422-ba9e-2d8ef803fa2a")  # API key from Sebastien de Menten (GFJ138), use with care
client.session.verify = False
df = client.query_day_ahead_prices("FR", start=pandas.Timestamp("2022", tz="CET"), end=pandas.Timestamp("2023/02", tz="CET"))
print(df["2022-12-30T22":"2022-12-31T02"])

outputs

2022-12-30 22:00:00+01:00    4.11
2022-12-30 23:00:00+01:00    4.11
2022-12-31 01:00:00+01:00    0.13
2022-12-31 02:00:00+01:00    0.06

where we are missing 2022-12-31 00:00:00+01:00

This is due to the filter https://github.com/EnergieID/entsoe-py/blob/master/entsoe/decorators.py#L139 which may skip the first row of the second and subsequent queries. This could be replaced by the following that explicitly filter index that would overlap with the previous frame

                        interval_mask = (
                            (frame.index <= _end)
                            & (frame.index > frames[-1].index[-1])
                        )

entsoe.version == "0.6.16"

Tijoxa commented 1 day ago

To clarify, the year_limited decorator tries to enforce the _start and _end timestamps to the result frame. However other side effects can happen:

To reproduce the bug

import pandas
from entsoe import EntsoePandasClient

client = EntsoePandasClient(api_key="397870bf-afdc-4422-ba9e-2d8ef803fa2a")  # API key from Sebastien de Menten (GFJ138), use with care
client.session.verify = False
df = client..query_installed_generation_capacity(
    "FR",
    start=pd.Timestamp("2017-01-01", tz="Europe/Paris"),
    end=pd.Timestamp("2023-01-01", tz="Europe/Paris"),
)
print(df.index)

outputs

DatetimeIndex(['2017-01-01 00:00:00+01:00'], dtype='datetime64[ns, Europe/Paris]', freq=None)

This is due to the fact that for the first block, the api returns a single row of date '2017-01-01 00:00:00+01:00' and the decorator doesn't filter out this value since it's the first frame. But for the following ones the api returns the first timestamp of each year ('2018-01-01 00:00:00+01:00', '2019-01-01 00:00:00+01:00', …). But those values are filtered out by the condition frame.index > _start.

You're solution handles that well too.