earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
358 stars 55 forks source link

Long runtime and wrong number of observations #257

Closed MuellerSeb closed 3 years ago

MuellerSeb commented 3 years ago

Describe the bug I experience very long execution times for the following script and get a wrong number of observations:

from wetterdienst.dwd import observations as obs

sites = obs.DWDObservationSites(
    parameter_set=obs.DWDObservationParameterSet.TEMPERATURE_AIR,
    resolution=obs.DWDObservationResolution.HOURLY,
    period=obs.DWDObservationPeriod.RECENT,
    start_date="2020-06-09 12:00:00",
    end_date="2020-06-09 12:00:00",
)
df = sites.all()
ids, lat, lon = map(np.array, [df.STATION_ID, df.LAT, df.LON])
observations = obs.DWDObservationData(
    station_ids=ids,
    parameters=obs.DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
    resolution=obs.DWDObservationResolution.HOURLY,
    start_date="2020-06-09 12:00:00",
    end_date="2020-06-09 12:00:00",
)
temp = np.array(observations.collect_safe().VALUE, dtype=float)
head = "id, lat, lon, temp"
np.savetxt("temp_obs.txt", np.array([ids, lat, lon, temp]).T, header=head)

After ~15min I get an Error, because there are 496 stations but only 494 temp values, so numpy can't save this to a text file.

Expected behavior When truncating the station array, it works fine:

ids, lat, lon = ids[:10], lat[:10], lon[:10]

Desktop (please complete the following information):

Am I doing something wrong?

Cheers, Sebastian

gutzbenj commented 3 years ago

Dear @MuellerSeb ,

thank you for coming back to us with this.

First of all several discussions exist within #256 , #226 and #225 about possible performance issues. The runtime will probably not improve drastically until we may figure out a detailled publishing plan of the data that DWD packs every once in a while (hourly data is roughly updated every once a day, but when exactly?).

Concerning the amount of values: The filter that is applied by the date given in DWDObservationSites() is based on what DWD has given as available data for a specific station and parameter set. This refers to the whole data - not only air temperature in 2m above ground! Besides this the API currently does not return anything, if no values are found. We will probably change this to return NaNs instead in a future iteration to get consistent output,

Does this roughly answer you question?

gutzbenj commented 3 years ago

Actually I have to answer myself here.

Originally when I had implemented the possibility to simply request a parameter instead of a whole parameter set, I assumed that the time series of a parameter that exists in multiple parameter sets would be equal. Unfortunately when checking out data of station 175 for both the air temperature and vapor pressure datasets (both include the temperature 2m above ground) I recognized the following:

data from vapor pressure dataset:

STATIONS_ID;MESS_DATUM;QN_8; TT; TD;eor 175;1955010103; 1; -3.0; -8.0;eor 175;1955010106; 1; -3.0; -8.0;eor 175;1955010109; 1; -3.0; -6.0;eor

data from the air temprature dataset:

STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor 175;1955010103; 5; -2.7; 70.0;eor 175;1955010104; 5; -2.6; 72.0;eor 175;1955010105; 5; -2.6; 73.0;eor 175;1955010106; 5; -2.6; 73.0;eor 175;1955010107; 5; -2.6; 76.0;eor 175;1955010108; 5; -2.9; 78.0;eor 175;1955010109; 5; -2.9; 79.0;eor

Following this we should actually create stronger relations between a parameter and a set e.g. air temperature in 2m should be taken from the air temperature dataset.

amotl commented 3 years ago

Dear @MuellerSeb,

thanks for writing in. Regardless of the details Benjamin was already referring to, can I also ask you to use collect_data() instead of collect_safe()? The latter will load all data into memory in order to return a single DataFrame and might well choke on that when loading data for all stations.

This is also what @wetterfrosch has been running into when trying to munge all historical data of the 10-minute resolution into InfluxDB, so #256 is kind of related in this regard.

collect_data() is the optimized variant and will yield multiple DataFrames which can be processed consecutively. Saying that, we should probably rethink the naming of these main workhorse methods and also adjust the documentation accordingly in order to make that more clear to the user.

In order to consecutively save data into a single CSV file using numpy.savetxt, I want to humbly direct you towards [1] or [2], which discuss the possibility to append data.

So, a rough sketch could be:


# Get a handle to the data.
observations = obs.DWDObservationData(...)

# Acquire output file in "append" mode.
with open("observations.txt", "ab") as fh:

    # Acquire data and process consecutively.
    for df in observations.collect_data():

        # Prepare data in NumPy variant.
        temp = np.array(df.VALUE, dtype=float)
        head = "id, lat, lon, temp"
        np.savetxt(fh, np.array([ids, lat, lon, temp]).T, header=head)

        # Append a newline after each block?
        fh.write(b"\n")

Please let us know if this helps already.

With kind regards, Andreas.

[1] https://stackoverflow.com/questions/54361557/appending-to-file-using-savetxt/54361619 [2] https://stackoverflow.com/questions/27786868/python3-numpy-appending-to-a-file-using-numpy-savetxt

amotl commented 3 years ago

@gutzbenj: Regarding the "number of observations" anomaly - maybe we should untangle the bundling of multiple parameters into one DWDObservationParameterSet completely again? Or do you see another option how to get a symmetric amount of results from both DWDObservationSites and DWDObservationData in a deterministic manner?

MuellerSeb commented 3 years ago

Thanks for the detailed help! I will try to use the collect_data method.

For the missing observations, I would at least want to get a warning and some information on which station has missing data. Returning nan would be most convenient, since that is what could be used for masking. For now I have to compare the output station ID array with the input stations to determine which station should be neglected. That's a pain!

Nevertheless: This is a very nice package! Will watch you growing :wink:

Cheers, Sebastian

gutzbenj commented 3 years ago

Dear Sebastian,

we have just figured out the required changes to make this work with more ease. You'll definitely see some changes towards a consise behavior in the next iterations. At the moment we have couple of things in our minds as you may recognize from the number of issues šŸ˜šŸ˜€ you can be sure that feedback like yours is very welcome as we are only working in a direction we guess is the best. If you may have some spare time we'd also like to welcome you on board, but looking on the number of repositories you are working with I guess you are already quite busy...

MuellerSeb commented 3 years ago

That's what I call good service :+1: Yeah I am quite busy, so more support than bug-reports is not possible for now. I am working on a kriging example for GSTools, since the next release will support lat-lon data, and I was looking for real measurements. And eventually I stumbled over this nice package. I can sent you a link, when it's finally in the documentation, if you want :wink:

Is there a plan for the next release of wetterdienst?

PS: using collect_data didn't improve runtime. :cry:

amotl commented 3 years ago

Hi Sebastian,

thanks for considering to use Wetterdienst for data acquisition within an example for GSTools. We are already honored that @kmuehlbauer did the same thing within a tutorial [1] for the wradlib package.

The next release [of GSTools] will support lat-lon data, and I was looking for real measurements. Is there a plan for the next release of wetterdienst?

I see. I will be happy to collaborate on that how we can bring both worlds together in an optimal manner.

PS: Using collect_data didn't improve runtime.

It won't improve runtime, but it should save memory. If there is more data to process than your machine has memory available, it won't choke on that.

With kind regards, Andreas.

[1] https://docs.wradlib.org/en/stable/notebooks/fileio/wradlib_load_DWD_opendata_volumes.html

gutzbenj commented 3 years ago

Dear @MuellerSeb ,

I'm still working on another issue atm but will try to make your example work and create a next release until sunday.

Meanwhile as soon as your release is done we would like to also include the interpolation scheme into our example here [1]. BTW for the purpose of speed you may as well want to filter for a state e.g. Saxony (Sachsen) this will cut the stations down by a lot but will still be a meaningful result!

sites = obs.DWDObservationSites(
    parameter_set=obs.DWDObservationParameterSet.TEMPERATURE_AIR,
    resolution=obs.DWDObservationResolution.HOURLY,
    period=obs.DWDObservationPeriod.RECENT,
    start_date="2020-06-09 12:00:00",
    end_date="2020-06-09 12:00:00",
)
df = sites.all()
df = df[df.STATE == "Sachsen"]  # or any other known state of Germany

[1] https://github.com/earthobservations/wetterdienst/blob/master/example/climate_observations.ipynb

MuellerSeb commented 3 years ago

This is the example in its current state: https://geostat-framework.readthedocs.io/projects/gstools/en/latlon_support/examples/08_geo_coordinates/01_dwd_krige.html#sphx-glr-examples-08-geo-coordinates-01-dwd-krige-py

When there is an update within the next week I can still modify the downloading routines.

Current form of the downloading routine (for the record):

def get_dwd_temperature():
    """Get air temperature from german weather stations from 9.6.20 12:00."""
    from wetterdienst.dwd import observations as obs  # version 0.10.1

    sites = obs.DWDObservationSites(
        parameter_set=obs.DWDObservationParameterSet.TEMPERATURE_AIR,
        resolution=obs.DWDObservationResolution.HOURLY,
        period=obs.DWDObservationPeriod.RECENT,
        start_date="2020-06-09 12:00:00",
        end_date="2020-06-09 12:00:00",
    )
    df0 = sites.all()
    ids, lat, lon = map(np.array, [df0.STATION_ID, df0.LAT, df0.LON])
    observations = obs.DWDObservationData(
        station_ids=ids,
        parameters=obs.DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
        resolution=obs.DWDObservationResolution.HOURLY,
        start_date="2020-06-09 12:00:00",
        end_date="2020-06-09 12:00:00",
    )
    df1 = observations.collect_safe()
    temp, ids1 = map(np.array, [df1.VALUE, df1.STATION_ID])
    select = np.isfinite(temp)  # care about missing values
    sorter = np.argsort(ids)  # care about missing stations
    sort = sorter[np.searchsorted(ids, ids1[select], sorter=np.argsort(ids))]
    ids, lat, lon, temp = ids[sort], lat[sort], lon[sort], temp[select]
    head = "id, lat, lon, temp"  # add a header to the file
    np.savetxt("temp_obs.txt", np.array([ids, lat, lon, temp]).T, header=head)
amotl commented 3 years ago

Dear Sebastian,

thanks for being able to push this further and for sharing the respective GSTools tutorial. So, these additional lines of code based on NumPy mangling fulfill the goal to compensate for missing stations or readings?

temp, ids1 = map(np.array, [df1.VALUE, df1.STATION_ID])
select = np.isfinite(temp)  # care about missing values
sorter = np.argsort(ids)  # care about missing stations
sort = sorter[np.searchsorted(ids, ids1[select], sorter=np.argsort(ids))]
ids, lat, lon, temp = ids[sort], lat[sort], lon[sort], temp[select]

Indeed, it would be nice if Wetterdienst could include an appropriate compensation on the Pandas level in order to yield the same results - right?

With kind regards, Andreas.

MuellerSeb commented 3 years ago

That's correct.

select = np.isfinite(temp)  # care about missing values

should actually be enough to filter out missing data. But now I have to look up the ids1 in the original station IDs array. And that always looks ugly with numpy (it is totally unobvious what is happening with this argsort-sorter stuff).

Edit: To be precise, I would like to see ids1 and ids to be always identical!

gutzbenj commented 3 years ago

Dear @MuellerSeb ,

with the latest release you will now get a Result that matches your requested stations.

264

MuellerSeb commented 3 years ago

@gutzbenj cool! Going to test it :wink:

MuellerSeb commented 3 years ago

@gutzbenj could you give me a hint how to properly rewrite the script I posted above?

It's not working out of the box and with renaming "DWDObservationStations" I now run into a numpy TypeError when using "savetxt" .. guess output was changed, so I got a shape missmatch which results on dtype=object. It is also still taking 15 min to execute, but I think that is still more a downloading problem.

gutzbenj commented 3 years ago

I'd write it this way:


import numpy as np

from wetterdienst.dwd.observations import DWDObservationStations, DWDObservationData, \
    DWDObservationParameterSet, DWDObservationParameter, DWDObservationResolution, \
    DWDObservationPeriod

stations = DWDObservationStations(
    parameter_set=DWDObservationParameterSet.TEMPERATURE_AIR,
    resolution=DWDObservationResolution.HOURLY,
    period=DWDObservationPeriod.RECENT,
    start_date="2020-06-09 12:00:00",
    end_date="2020-06-09 12:00:00",
)
df = stations.all()
ids, lat, lon = df.loc[:, ["STATION_ID", "LAT", "LON"]].values.T

observations = DWDObservationData(
    station_ids=ids,
    parameters=DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
    resolution=DWDObservationResolution.HOURLY,
    start_date="2020-06-09 12:00:00",
    end_date="2020-06-09 12:00:00",
)
temp = observations.all().VALUE.values
ids = np.array(ids, dtype=int)  # convert back
head = "id, lat, lon, temp"
np.savetxt("temp_obs.txt", np.array([ids, lat, lon, temp]).T, header=head)

The TypeError is caused by STATION_ID which was changed to string (STATION_ID is now generally mapped as string instead the integer it was before).

I am not sure that we can drastically improve the downlaod time as the request will have to download hourly data of ~500 stations for the RECENT period (last 500 days) due to the request date. This totals in 6_000_000 values only for temperature while we still have to download the entire parameter set and extract the single parameter. It will also download historical data as of the date being in the LAST year. You can improve the speed by filtering additionally by period setting it to RECENT as long as your requested date lays in that time span (June of 2020 is withing the 500 days span) e.g.


observations = DWDObservationData(
    station_ids=ids,
    parameters=DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
    resolution=DWDObservationResolution.HOURLY,
    periods=DWDObservationPeriod.RECENT,
    start_date="2020-06-09 12:00:00",
    end_date="2020-06-09 12:00:00",
)

This took 1.30 minutes on my machine.

MuellerSeb commented 3 years ago

Ahh cool! Thanks!

PS: wanted to create a settings dict for both calls and stumbled over periods vs. period in DWDObservationData/Station.

gutzbenj commented 3 years ago

We are aware of this and plan to use only singular argument names in the future! Thanks again for the feedback!