Discarded data in EntsoePandasClient._query_unavailability

alp-yg commented 3 years ago

Hi there,

While looking at the data returned by a call to EntsoePandasClient._query_unavailability, I saw that there was data missing from what we can see on the transparency website.

It turns out that the issue comes from the year_limited decorator, and more precisely from this line (l.839 of entsoe.py in the current version): df = df.loc[~df.index.duplicated(keep='first')]

What is the intended purpose of this line? I'm asking this because at the moment, data returned by the API for doctypes A77/A80 is split by time periods, not by version or something like that. That means that this line only keeps the first period of the entire actual outage, discarding lots of data in the process.

In my opinion, I'd say what needs to be changed is the parsing of outages (function _outage_parser in file parsers.py) to concatenate the different time periods and thus return a dataframe with only one line.

Best, Yannick

fboerman commented 3 years ago

@JrtPec could you perhaps provide more info on why this was designed this way?

JrtPec commented 3 years ago

That sounds like a bug indeed.

I suppose the original design had a simple reason: when combining data of multiple years, an overlap in indices was noticed by someone (me, I suppose), so I added a line that deletes duplicate indices after the merge.

andreasbrinch commented 2 years ago

I believe the fix introduces another bug when data is the same for multiple indeces.

I am missing entries with the following code:

start = pd.Timestamp('20220214', tz='Europe/Copenhagen')
end = pd.Timestamp('20220216', tz='Europe/Copenhagen')
country_code = 'DK_1' 
df = client.query_day_ahead_prices(country_code, start=start, end=end)

I guess the following could solve both cases: df = df.loc[~df.duplicated(keep='first') | ~df.index.duplicated(keep='first')]

fboerman commented 2 years ago

hi @andreasbrinch I changed some things and no longer missing data in query day ahead prices. Could you confirm this for the latest version that you are no longer missing data?

fboerman commented 12 months ago

discussion on duplication issues is being handled in #235

EnergieID / entsoe-py

Discarded data in EntsoePandasClient._query_unavailability #85