earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
358 stars 55 forks source link

Aligning radar data upstream timestamps to straight/floored interval marks #193

Closed amotl closed 4 years ago

amotl commented 4 years ago

Introduction

Within #190, we are trying to get hold of all possibilities to acquire radar data from the DWD data repository. We found different anomalies there. The most prominent one is the 5 minute mark alignment problem which is revolving around the possibility to address specific files by using timestamps. So, I would like to spawn a discussion about that within a different issue (here).

cc @meteoDaniel, @kmuehlbauer

amotl commented 4 years ago

Remark: Within this post, I've deliberately reformatted timestamps extracted from filenames in order to improve readability.

Timestamps of BINARY files within /radar/sites

When looking at the data again in more depth, I just found that only data for dx, pf, px, px250 and sweep_pcp within [1] have timestamps within its filenames aligned to straight intervals of 5 minute marks. Data within other folders deviate on the minute part.

Let's use PE_ECHO_TOP as an example. Some samples from there [2]:

[1] https://opendata.dwd.de/weather/radar/sites/ [2] https://opendata.dwd.de/weather/radar/sites/pe/boo/

Timestamps of HDF5 files within /radar/sites

Similar to what can be observed for the BINARY files, the timestamps embedded into HDF5 file names show the same behavior.

Let's use sweep_vol_v as an example. Some samples from there [3]:

Here, timestamps for the different elevations are not aligned with each other on the same 5 minute marks.

[3] https://opendata.dwd.de/weather/radar/sites/sweep_vol_v/boo/hdf5/filter_polarimetric/

amotl commented 4 years ago

Thoughts

While we will probably be safe to ignore the seconds part on all timestamps, all these timestamps will still deviate in their minutes parts.

Until we get this straight, the approach to address files by specific timestamps through the corresponding fileindex will not be feasible. The average user will probably be expect the timestamps to be properly aligned to 5 minute marks like :00, :05, :10, etc.

kmuehlbauer commented 4 years ago

@amotl Thanks for raising this. I'll try to shed some light on this.

The DWD radar's run a 5 minute schedule where a precip scan (one sweep) is recorded and after that a volume scan (10 sweeps). The filename timestamp refers to the sweep start time. The filename timestamps of the two moments (Z, V) are the same for the same sweep, since they are actually derived from the same measurement.

For the precip scan I would expect to get the file inside the 5 minute span (eg :00 to :05).

For the volume scan I would expect to get all selected sweeps (elevations) in the requested 5 minute span.

Does that make sense?

amotl commented 4 years ago

Dear Kai,

thanks for your explanations. So, everything will probably be just fine when we query the respective data by timestamp ranges? Sorry that I might have missed this detail, as I made all those tests within #190 just invoke the machinery with single point-in-time timestamps.

I am also lacking the domain knowledge here which you and @meteoDaniel have. I am just asking how to provide a convenient interface to the user (DWIM). Will she always have to provide two timestamps (start_date and end_date) to reach out to those data?

Or would a heuristic make sense to let the user provide a single timestamp and let the machinery deduce the appropriate time range from the respective 5 minute mark surrounding.

With kind regards, Andreas.

kmuehlbauer commented 4 years ago

@amotl There are two possibilities. First, if the user requests the 15:00 volume, she expects the data from 14:55 to 15:00 or second, if the user requests 15:00, she expects 15:00 to 15:05. In our group we use the latter, not only because it aligns with the day.

That said, if given only a start_date I would expect data from that 5 minutes timerange aligned to full 5 minutes.

amotl commented 4 years ago

Thanks again. Your insights about best practices absolutely make sense.

That said, if given only a start_date I would expect data from that 5 minutes timerange aligned to full 5 minutes.

To be more precise on that, if the user says start_date=15:00, the machinery should automatically expand that to end_date=15:04:59 before querying the fileindex? I am asking that because I fear the process could catch files already belonging to the next volume when naively expanding to end_date=15:05:00 - of course depending on how this constraint will be applied (less vs. less-or-equals).

As you know, the devil is always in the details and I am aiming to get everything both correct and DWIM here.

kmuehlbauer commented 4 years ago

@amotl Absolutely!

amotl commented 4 years ago

Thanks again. When trying to start implementing this, I just found the fileindex query logic can not do timerange-based queries yet. Its machinery prepares a set of discrete datetime values within __build_date_times() [1]. While there is a pd.date_range() involved when computing the datetime values, it does not do the right thing in the end when the query is actually applied using an equality match like

# Filter by date.
fi = file_index[file_index[DWDMetaColumns.DATETIME.value] == date_time]

within _collect_radar_data() or create_filepath_for_radolan(). That will never be capable of invoking a timerange query, bummer.

We will have to adjust that properly before trying to implement the behavior we are aiming at here.

[1] https://github.com/earthobservations/wetterdienst/blob/radar-more/wetterdienst/dwd/radar/api.py [2] https://github.com/earthobservations/wetterdienst/blob/radar-more/wetterdienst/dwd/radar/access.py

gutzbenj commented 4 years ago
# Filter by date.
fi = file_index[file_index[DWDMetaColumns.DATETIME.value] == date_time]

This should be possible with

file_index = file_index.set_index("DATETIME")
file_index = file_index.reindex(file_index.index | searched_datetimes)
fileindex = fileindex.fillna(method="backfill")

files = file_index.loc[searched_datetimes, "FILENAME"]

where searched_datetimes is the entered datetimes series.

meteoDaniel commented 4 years ago

I am having a quite different view on that process because if I am running a script at 15:03 I expect the data for 15:00 because my main target is to download data asap. For this case latest option is a great opportunity, so it would be great to hold this feature. Then I am fine to go on as @kmuehlbauer suggested for selecting datetimes.

kmuehlbauer commented 4 years ago

@meteoDaniel Your workflow does not necessarily contradict with my suggestion, but you made a point here. A latest option as I suggested, would refer to the actual 5 minutes timespan. In your case (15:03) it would retrieve all files which are available from 15:00:00 to 15:04:59. But what if the latest file is from 14:58:30? It looks like I forgot that possibility and yes, I would use a latest option in that context, too. Any thoughts?

Please bear with me while I'm trying to understand more of the internals of wetterdienst.

meteoDaniel commented 4 years ago

@kmuehlbauer now I am understanding your point. The main idea of wetterdienst is to define what you want to have and receive it. E.g. receiving temperature data you pass Parameter.TEMPERATURE, TimeResolution.HOURLY and PeriodType.RECENT for a given set of station ids. The function returns a dataFrame with the data. Acquire radar data has to treat different due to a time schedule orientated publishing of the data. So it is quite new to wetterdienst.

One possible solution could be to list all file names and use a regex to floor all minutes and seconds.

    reformat_mapping = {'33': '30' , '03': '00'} # and so on 

   data[FILE_NAME].replace(reformat_mapping, regex=True)

Afterwards we are able to derive 5 minute interval timestamps from the directory. This output can be added to the file_index

@ kai: Do you know more about the scan schedule at DWD?

kmuehlbauer commented 4 years ago

@meteoDaniel Thanks for making this clear.

Acquire radar data has to treat different due to a time schedule orientated publishing of the data. So it is quite new to wetterdienst.

I very much appreciate the efforts to include this functionality into wetterdienst.

amotl commented 4 years ago

Hi again,

thanks for this discussion, appreciate it! Thanks also for all your suggestions. My thoughts on this are to just use this basic idiom for querying the existing fileindex by datetime ranges as we already do within the subsystem handling "observations" data:

https://github.com/earthobservations/wetterdienst/blob/85e72ccdbd00f0e8285e1ba24800dfafb81ccd63/wetterdienst/dwd/observations/api.py#L207-L212

In that way, we would not have to tweak the fileindex itself, like @gutzbenj and @meteoDaniel suggested in different ways. I believe this is the most straightforward thing to do here instead of either applying backfilling or replacement operations on the datetime column of the fileindex, without amending the original data.

In order to implement this in a DWIM-style manner from the perspective of the user, I'd suggest to

if not end_date:

    # Align "start_date" to the 5 minute mark before tm.
    # https://stackoverflow.com/a/3464000
    start_date_input = pd.to_datetime(start_date, infer_datetime_format=True)
    start_date_aligned = start_date_input - timedelta(
        minutes=start_date_input.minute % 5, seconds=start_date_input.second, microseconds=start_date_input.microsecond
    )

    # Expand "end_date" to the end of the 5 minute mark.
    start_date = start_date_aligned
    end_date = start_date_aligned + timedelta(minutes=5)

This solution will enable the user to specify a single arbitrary point in time which will expand to an interval of 5 minutes with well-aligned boundaries designating the volume to query the data. On top of that, synthesizing a "latest" retrieval option is easy by just using start_date=now().

With kind regards, Andreas.

amotl commented 4 years ago

Currently, the date_times parameter is the single parameter designating the time-part when querying data. https://github.com/earthobservations/wetterdienst/blob/f34ef6001cb1bc1f138b1dbad23076eff5706c05/wetterdienst/dwd/radar/api.py#L52-L57

When alternatively using the start_date and end_date parameters, date_times will be derived from them using __build_date_times().

Then, it is passed down to collect_data(): https://github.com/earthobservations/wetterdienst/blob/f34ef6001cb1bc1f138b1dbad23076eff5706c05/wetterdienst/dwd/radar/api.py#L145-L160

However, all of that doesn't fit our bill here. We'd really have to pass down start_date and end_date into the machinery in order to perform real range queries on the dataframe as outlined in my previous comment. Using a list of discrete timestamps embedded within date_times just would not work.

Please correct me if I'm wrong on any of these details.

amotl commented 4 years ago

Dear Daniel,

the reason why I am elaborating this in detail is that I recognized you had introduced the date_times parameter the other day in order to retrieve RADOLAN data appropriately, which is aligned to HH:50 interval boundaries. While I've tried to keep that parameter during my work on #190, I believe it might make sense to get rid of it completely now and just pass start_date and end_date around, optionally extrapolating end_date like outlined above.

My rationale on this is that the need to compute a list / series of date_times upfront puts too much burden on the user.

Because I don't want to step on your toes here with respect to the intended style of access to the data, I am humbly asking if a different style comprised of start_date and end_date parameters will also fit your requirements for accessing both RADOLAN and other radar data?

With kind regards, Andreas.

meteoDaniel commented 4 years ago

Dear @amotl I am fine with your outlines. It is reasonable to go this way that is the point in the end. And I am pretty sure that it does not matter under the perspective of performance how to deal this in the end. So I think in the end it is the decision of the contributor on such a point how to deal with such a problem. Thanks a lot and have a great day!

amotl commented 4 years ago

Hi Daniel,

thanks for acknowledging what I am planning here with respect to the appropriate alignment of requests to respective 5 minute marks, which is applicable to all non-RADOLAN radar data.

On the other hand, in order to protect the alignment of RADOLAN requests to straight HH:50 interval marks, I've just added two additional tests through 134504fd0a.

These tests aim to verify the date_times tweaking logic https://github.com/earthobservations/wetterdienst/blob/134504fd0add323936cf322ae3b73945f92d96dc/wetterdienst/dwd/radar/api.py#L123-L125

Flooring 00:53:53 to 00:50:00 works perfectly, see https://github.com/earthobservations/wetterdienst/blob/134504fd0add323936cf322ae3b73945f92d96dc/tests/dwd/radar/test_api_historic.py#L28-L45

However, flooring 00:42:42 to 23:50 on the previous day https://github.com/earthobservations/wetterdienst/blob/134504fd0add323936cf322ae3b73945f92d96dc/tests/dwd/radar/test_api_historic.py#L48-L65

does not work as expected yet:

E       AssertionError: assert [Timestamp('2019-08-08 00:50:00')] == [datetime.datetime(2019, 8, 7, 23, 50)]
E         At index 0 diff: Timestamp('2019-08-08 00:50:00') != datetime.datetime(2019, 8, 7, 23, 50)
E         Full diff:
E         - [datetime.datetime(2019, 8, 7, 23, 50)]
E         + [Timestamp('2019-08-08 00:50:00')]

While continuing my work on #190, I will try to also care about that detail if you also believe that going back to the most recent RADOLAN timestamp mark would be the right decision to make here.

With kind regards, Andreas.

meteoDaniel commented 4 years ago

Change

+ pd.Timedelta( 
     minutes=50 
 ) 

to

- pd.Timedelta( 
     minutes=10 
 )

Expected behaviour: Asking at 12:32 for the latest file would return -> 11:50

amotl commented 4 years ago

Hi Daniel,

using - pd.Timedelta(minutes=10) would also respond to 12:54 with 11:50. Sad but true.

Nevermind, I've already introduced respective helper functions to solve that, see https://github.com/earthobservations/wetterdienst/blob/968fede64a0181692895a5a149643bca2516ad1c/wetterdienst/util/datetime.py#L4-L16 and https://github.com/earthobservations/wetterdienst/blob/968fede64a0181692895a5a149643bca2516ad1c/wetterdienst/util/datetime.py#L19-L38

Both functions are covered by corresponding tests, so anyone is free to optimize them.

Cheers, Andreas.

amotl commented 4 years ago

Hi there,

I've introduced the RadarDate.MOST_RECENT option related to RadarParameter.SWEEP_* data in HDF5 format in order to solve a problem deterministically which I stumbled upon when working on the strong test suite which covers all the different flavors.

I have described the background about it within https://github.com/earthobservations/wetterdienst/pull/190#issuecomment-701452354. I hope you are fine with that and I will be happy to answer any questions about it within this discussion.

Cheers, Andreas.

kmuehlbauer commented 4 years ago

@amotl Thanks! Great work! :100: :rocket:

I've created a gist here using the current feature branch. There are a few problems with proper naming. I've added explanations on what I expect, open for discussion on that.

Unfortunately the REPR outputs of xarray are somewhat broken in the notebook rendering, but if you try locally it will just work.

Also you'll see current wradlib CfRadial/ODIM_H5 implementation in action.

MOST_RECENT works nice for the precip scan (only one sweep) in most cases. But if you request these files before they are acquired and put on the server (say at 15:00:30) then nothing will be fetched or only partially. I'm currently thinking how to tackle this, but have not yet an idea. Maybe it's the users responsibility.

Same is true for the volume scan where only already acquired sweeps (in the current 5 min span) are fetched. So If you want to get the latest volume complete, you would need to start your request at 15:04:XX (after the last sweep is acquired). Here too I've not yet a suggestion how make this work.

What came to mind would be LATEST_FULL (or something along that line) which would retrieve the latest full five minutes (request 15:03, retrieve 14:55-14:59:59). Also a time span request (start_date/end_date) would be nice. With that the user could build up its own machinery of multiple requests.

Anyway, the retrieval worked smoothly. Do you have a hint for me how to avoid the creation of tempfiles? I was trying to wrap the buffer like a file-like object, but wasn't successful so far.

kmuehlbauer commented 4 years ago

Anyway, the retrieval worked smoothly. Do you have a hint for me how to avoid the creation of tempfiles? I was trying to wrap the buffer like a file-like object, but wasn't successful so far.

Never mind, it works just fine using the buffer, ~I'll have to make wradlib use the buffer (not implemented yet)~.

Update: This works out of the Box in wradlib. I was just using the buffer incorrectly. That's even more impressive, since we can now transfer the data directly into an in memory xarray dataset. No hassle with files. Very, very nice and convenient...

amotl commented 4 years ago

Dear Kai,

thanks a bunch for your feedback on this. Great that it already works reasonably for you. Also thanks for creating the gist which shows everything in action and also great to see that you already used the new RadarResult.url attribute coming from the most recent e4057c3 to apply some ad hoc filtering by subset=filter_simple on the HDF5 data.

I also found the RadarParameter.SWEEP_VOL_PRECIPITATION_V etc. constants to be a bit misnamed, so thanks for suggesting variants for appropriate renaming. I will use:


But if you request these files before they are acquired and put on the server (say at 15:00:30) then nothing will be fetched or only partially. I'm currently thinking how to tackle this, but have not yet an idea. Maybe it's the users responsibility.

This.

If you want to get the latest volume complete [...]. What came to mind would be LATEST_FULL (or something along that line) which would retrieve the latest full five minutes (request 15:03, retrieve 14:55-14:59:59).

That is exactly what I tried to do with MOST_RECENT, see https://github.com/earthobservations/wetterdienst/pull/190#issuecomment-701452354. I will be happy to rename this to LATEST_FULL or MOST_RECENT_FULL:

# HDF5 folders do not have "-latest-" files, so we will have to synthesize them 
# appropriately by going back to the second last volume of 5 minute intervals.
# The reason for this is that when requesting sweep data in HDF5 format at 
# e.g. HH:12:00, not all files will be available on the DWD data repository 
# for the whole volume (e.g. covering all elevation levels) [...]

I will adjust this inline comment to be a bit more precise by borrowing from your phrasing

Using start_date=MOST_RECENT will make the machinery retrieve the latest full five minutes by addressing the previous volume of 5 minute intervals (request 15:03, retrieve 14:55-14:59:59).


Also a time span request (start_date/end_date) would be nice. With that the user could build up its own machinery of multiple requests.

That should well be possible. start_date and end_date will happily accept timestamps in str and datetime formats.


Do you have a hint for me how to avoid the creation of tempfiles? I was trying to wrap the buffer like a file-like object, but wasn't successful so far.

I used wrl.io.read_opera_hdf5() which only accepts filenames: https://github.com/earthobservations/wetterdienst/blob/4de076df7e95264fcbcce92369c86f50b0973c37/example/radar/radar_sweep_hdf5.py#L74-L76

Same here with wrl.io.read_dx(): https://github.com/earthobservations/wetterdienst/blob/4de076df7e95264fcbcce92369c86f50b0973c37/example/radar/radar_site_dx.py#L78-L80

On the other hand, wrl.io.read_radolan_composite(item.data) will happily accept the BytesIO object coming back from the acquisition machinery: https://github.com/earthobservations/wetterdienst/blob/4de076df7e95264fcbcce92369c86f50b0973c37/example/radar/radar_radolan_cdc.py#L137-L141

Never mind, it works just fine using the buffer.

Can you show me how to make that work for HDF5 data?

For DX data, it would be nice to improve get_radolan_filehandle(fname) like:

if isinstance(fname, BytesIO):
    return fname

With kind regards, Andreas.

kmuehlbauer commented 4 years ago

@amotl Thanks a bunch. This will really fit nicely into our workflows.

variants for appropriate renaming. I will use:

  • SWEEP_PCP_VELOCITY_H and SWEEP_PCP_REFLECTIVITY_H
  • SWEEP_VOL_VELOCITY_H and SWEEP_VOL_REFLECTIVITY_H

very well chosen. I'm still hoping DWD will add more radar moments. With this wetterdienst is on the safe-side, even if DWD adds SWEEP_PCP_VELOCITY_V (vertical velocity). JFTR, not to be misinterpreted as U/V winds, but radial velocity from H/V-pol channels.

That is exactly what I tried to do with MOST_RECENT, see #190 (comment). I will be happy to rename this to LATEST_FULL or MOST_RECENT_FULL:

Thanks for clarifying again. MOST_RECENT is just fine if we consider a volume request. But I'm still thinking about the naming here. So there are some thoughts I had.

What would be nice to have from a user perspective if this could somehow be aligned with your naming convention.

RECENT_10 - will fetch last 10 full volumes (15:03 will fetch 14:10 to 14:59:59), 10 could be any reasonable number RECENT_HOUR - will fetch last hour

It's not necessary since the user can easily request what he wants by setting start and end time.

I used wrl.io.read_opera_hdf5() which only accepts filenames:

see https://github.com/wradlib/wradlib/issues/460

Unfortunately read_odim can currently only consume file-like objects for the initial phase of loading. The moment retrieval is done via xarray backend machinery. The calling needs fixing in wradlib. So for the moment we have to go the way via tempfiles.

amotl commented 4 years ago

Dear Kai,

thanks again for sharing your thoughts. I will see what I can do about it. 7162e75 already implements the renaming of the RadarParameter.SWEEP_ constants as you suggested.

Cheers, Andreas.

amotl commented 4 years ago

JFTR

As outlined within https://github.com/earthobservations/wetterdienst/pull/190#issuecomment-701427913, some tests will still fail because simple vs. polarimetric subsets are not taken into account yet. Nothing new. The expected outcome is:

FAILED tests/dwd/radar/test_api_most_recent.py::test_radar_request_site_most_recent_sweep_vol_v_hdf5 - assert 20 == 10

However, we just received this outcome which indicates that at the time of request, the volume we addressed was not completely available yet.

FAILED tests/dwd/radar/test_api_most_recent.py::test_radar_request_site_most_recent_sweep_vol_v_hdf5 - assert 13 == 10

That is exactly what I tried to do with MOST_RECENT, see #190 (comment).

So, we might want to check the implementation of MOST_RECENT in this context again and/or investigate why the data has not been available on the DWD data repository in time, even when going back to the second latest 5 minute interval as discussed.

amotl commented 4 years ago

Now, spotted this when running the same test case at ~14:05:08:

FAILED tests/dwd/radar/test_api_most_recent.py::test_radar_request_site_most_recent_sweep_vol_v_hdf5 - assert 0 == 10

The reason for this flakyness was some result caching we applied to the file index introduced by #169 the other day. Fixed that by removing caching completely there within ad932262d. Unfortunately, this increases the runtime of the radar test suite from ~30s to ~1m30s.

time pytest -vvvvv -k test_radar

Conclusion

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

As we've probably all been aware of ;].

Edit

Re-enabled caching conditionally with 4d9a2ad0e again. While I am not sure about this in the long run as it will still be able to produce flaky behavior, I dearly need it right now to keep the test suite runtime duration reasonably low while still working on this.

amotl commented 4 years ago

I've created a gist here using the current feature branch.

-- https://github.com/earthobservations/wetterdienst/issues/193#issuecomment-701919828

While working on #190, I've just added both examples in non-notebook variants through 26ce5a5. Thanks again!

amotl commented 4 years ago

Dear @kmuehlbauer,

I am writing in response to your comment https://github.com/earthobservations/wetterdienst/issues/193#issuecomment-702013231 here in order to summarize the current state of the implementation wrt. the start_date argument.

RadarDate.MOST_RECENT

  • MOST_RECENT - will retrieve most recent full volume (15:03 will fetch 14:55 to 14:59:59)

Exactly.

This option is available for site/SWEEP data

https://github.com/earthobservations/wetterdienst/blob/f43ef6e3e9d4fd0d4d24d6157a639ea985c95438/tests/dwd/radar/test_api_most_recent.py#L18-L24

and RADOLAN_CDC data:

https://github.com/earthobservations/wetterdienst/blob/f43ef6e3e9d4fd0d4d24d6157a639ea985c95438/tests/dwd/radar/test_api_most_recent.py#L104-L108

RadarDate.CURRENT

  • LATEST - will retrieve all sweeps from current 5 minutes span (15:03 will fetch 15:00 and newer) (even if there is no -latest- file)

I've strictly reserved LATEST for retrieving physical *-latest-* files. However, I've now introduced RadarDate.CURRENT, which should do what you are describing here.

This option is available for site/SWEEP data

https://github.com/earthobservations/wetterdienst/blob/f43ef6e3e9d4fd0d4d24d6157a639ea985c95438/tests/dwd/radar/test_api_current.py#L16-L22

and RADOLAN_CDC data:

https://github.com/earthobservations/wetterdienst/blob/f43ef6e3e9d4fd0d4d24d6157a639ea985c95438/tests/dwd/radar/test_api_current.py#L105-L109

RadarDate.RECENT_

  • RECENT_10 - will fetch last 10 full volumes (15:03 will fetch 14:10 to 14:59:59), 10 could be any reasonable number
  • RECENT_HOUR - will fetch last hour

It's not [absolutely] necessary since the user can easily request what he wants by setting start and end time.

I second that. Instead of introducing yet more parameters, I'd also recommend using start_date and end_date, e.g. like:

https://github.com/earthobservations/wetterdienst/blob/f43ef6e3e9d4fd0d4d24d6157a639ea985c95438/tests/dwd/radar/test_api_recent.py#L17-L24

https://github.com/earthobservations/wetterdienst/blob/f43ef6e3e9d4fd0d4d24d6157a639ea985c95438/tests/dwd/radar/test_api_recent.py#L58-L65

With kind regards, Andreas.

kmuehlbauer commented 4 years ago

I am writing in response to your comment #193 (comment) here in order to summarize the current state of the implementation wrt. the start_date argument.

@amotl This is just great work! I'm completely fine with with the current state of the implementation. With that users are able to request whatever data they like. I'll try to test asap.

amotl commented 4 years ago

I believe we want to close this as the outcome is already released with wetterdienst 0.9.0. Thanks again, everybody!

gutzbenj commented 4 years ago

Great work of you, Andreas!