More specification for behaviour of target_datetime

jarek commented 6 years ago

Hi,

I was thinking about implementing target_datetime in some of the parsers, and came up with the following questions. If you let me know what you prefer, I can update comments in example parser and README to match :+1:

What is expected behaviour if target_datetime doesn't exactly match available data?
- example: 18:45 is requested, but data is only available in hourly segments. Return closest data point (19:00) or return None? I see that ENTSOE parser has closest_in_time_key logic that would do the former - should that be the guideline?
- example: 18:30 is requested, but data is only available in hourly segments. Round up or down, or return None when ambiguous?
What is expected behaviour when the parser can get some historical data, but not necessarily all? Examples: there might be data for only 2018 and not 2017; or there is data for 2018-03-01 and 2018-03-03 but 2018-03-02 is missing.
- return empty dict, or empty list?
- return None?
- raise NotImplementedException?
Is it worth implementing target_datetime if max time in the past is within 24 hours, or is it sufficient to return 24 hours' worth of history when invoked without a target_datetime parameter?
ENTSOE parser's fetch_exchange always returns a list, even when returning one datapoint via target_datetime parameter. Is this required, or is it fine to return a list normally but a dict when called with target_datetime?

maxbellec commented 6 years ago

Thanks a lot jarek for your question, we haven't finished the specifications, but maybe your input could help. And your questions definitely help. I'm currently changing the ENTSOE.py code so that's not a reference.

In general, since there are many parsers and only one function to launch them all (let's call it launch_parser), the logic should be as much as possible in launch_parser so that the parser get be as simple as possible.

Ideally, the parser should return as many values as possible (but keeping it simple, only one query) starting from the target_datetime. If with a single query the parser can get data for a whole day (three days / 5 hours ...), it should return the data for the whole day (three days / 5 hours ...). If the parser only fetches a single datapoint, it should return a single datapoint. We will keep a hard-coded record of the timespan each parser can fetch. The idea is that if we're missing a week, we'll launch the parser once for every day if it can return for a whole day, or once for every hour if it can handle only a single datapoint.

If the parser cannot fetch for the required datetime (whatever the reason), it should return None.

So regarding more specifically your questions :

the parser should return all datapoints within the range it can return (whether there is data every 5 minutes, every hour or every 6 hours)
if it returns a single datapoint, return the datapoint closest in time to the requested target_datetime (whatever the time difference between the two, launch_parser will throw away values that are too far from the requested datetime) If two datapoints are as close, return any or both. When returning the datapoint, the datetime value should correspond to the datapoint datetime, not the requested datetime.
yes, it is worth implementing target_datetime if you can only get data for the last 24 hours. In that case, whatever the (non-None) target_datetime, always return data for the last 24 hours is perfectly valid (launch_parser will throw away the values too far from the target_datetime)
if fetching a single datapoint, returning a single-item list or one element are both valid

This may not be super clear, don't hesitate if something's not clear or if you feel we can do something easier / better.

maxbellec commented 6 years ago

@corradio I believe it's what we talked about, don't hesitate to react if it's not or if something wasn't clear

corradio commented 6 years ago

I think that pretty much sums it up. Thanks @maxbellec ! I might add that in this iteration we're trying to stay as agile as possible and so we're optimising for simplicity rather than future scalability. With that in mind, we might want to add more information to parsers themselves in the future to optimise things further - but for now, we're keeping it simple.

jarek commented 6 years ago

Okay, thanks!

To try to summarize:

parsers can return as much data as feasible, provided it is close to target_datetime, with guideline being the amount of data returned in one HTTP request by source API
timestamp datapoints correctly as indicated by source API
don't worry too much about how close is close enough, as backend will reject datapoints it doesn't like
when no matching data is available, return None

I think that makes sense - certainly it does for now.

maxbellec commented 6 years ago

We talked about it again with @corradio. @jarek I'll steal your summary and add:

target_datetime means datetime for the latest data the parser will return. So if the parser returns data for 24h hours, it should return data from 24 hours before target_datetime until target_datetime. The idea is that live data can be treated by simply doing target_datetime=datetime.datetime.now()
return as much data as feasible, provided it is close to target_datetime, with guideline being the amount of data returned in one HTTP request by source API
timestamp datapoints correctly as indicated by source API
don't worry too much about how close is close enough, as backend will reject datapoints it doesn't like
when no matching data is available, return None

I'll adapt example.py as a consequence

jarek commented 6 years ago

This looks fine now after #1237, I'll close it. Thanks!

electricitymaps / electricitymaps-contrib

More specification for behaviour of target_datetime #1203