Closed MuellerSeb closed 3 years ago
Dear @MuellerSeb ,
thank you for coming back to us with this.
First of all several discussions exist within #256 , #226 and #225 about possible performance issues. The runtime will probably not improve drastically until we may figure out a detailled publishing plan of the data that DWD packs every once in a while (hourly data is roughly updated every once a day, but when exactly?).
Concerning the amount of values: The filter that is applied by the date given in DWDObservationSites() is based on what DWD has given as available data for a specific station and parameter set. This refers to the whole data - not only air temperature in 2m above ground! Besides this the API currently does not return anything, if no values are found. We will probably change this to return NaNs instead in a future iteration to get consistent output,
Does this roughly answer you question?
Actually I have to answer myself here.
Originally when I had implemented the possibility to simply request a parameter instead of a whole parameter set, I assumed that the time series of a parameter that exists in multiple parameter sets would be equal. Unfortunately when checking out data of station 175 for both the air temperature and vapor pressure datasets (both include the temperature 2m above ground) I recognized the following:
data from vapor pressure dataset:
STATIONS_ID;MESS_DATUM;QN_8; TT; TD;eor 175;1955010103; 1; -3.0; -8.0;eor 175;1955010106; 1; -3.0; -8.0;eor 175;1955010109; 1; -3.0; -6.0;eor
data from the air temprature dataset:
STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor 175;1955010103; 5; -2.7; 70.0;eor 175;1955010104; 5; -2.6; 72.0;eor 175;1955010105; 5; -2.6; 73.0;eor 175;1955010106; 5; -2.6; 73.0;eor 175;1955010107; 5; -2.6; 76.0;eor 175;1955010108; 5; -2.9; 78.0;eor 175;1955010109; 5; -2.9; 79.0;eor
Following this we should actually create stronger relations between a parameter and a set e.g. air temperature in 2m should be taken from the air temperature dataset.
Dear @MuellerSeb,
thanks for writing in. Regardless of the details Benjamin was already referring to, can I also ask you to use collect_data()
instead of collect_safe()
? The latter will load all data into memory in order to return a single DataFrame and might well choke on that when loading data for all stations.
This is also what @wetterfrosch has been running into when trying to munge all historical data of the 10-minute resolution into InfluxDB, so #256 is kind of related in this regard.
collect_data()
is the optimized variant and will yield multiple DataFrames which can be processed consecutively. Saying that, we should probably rethink the naming of these main workhorse methods and also adjust the documentation accordingly in order to make that more clear to the user.
In order to consecutively save data into a single CSV file using numpy.savetxt
, I want to humbly direct you towards [1] or [2], which discuss the possibility to append data.
So, a rough sketch could be:
# Get a handle to the data.
observations = obs.DWDObservationData(...)
# Acquire output file in "append" mode.
with open("observations.txt", "ab") as fh:
# Acquire data and process consecutively.
for df in observations.collect_data():
# Prepare data in NumPy variant.
temp = np.array(df.VALUE, dtype=float)
head = "id, lat, lon, temp"
np.savetxt(fh, np.array([ids, lat, lon, temp]).T, header=head)
# Append a newline after each block?
fh.write(b"\n")
Please let us know if this helps already.
With kind regards, Andreas.
[1] https://stackoverflow.com/questions/54361557/appending-to-file-using-savetxt/54361619 [2] https://stackoverflow.com/questions/27786868/python3-numpy-appending-to-a-file-using-numpy-savetxt
@gutzbenj: Regarding the "number of observations" anomaly - maybe we should untangle the bundling of multiple parameters into one DWDObservationParameterSet
completely again? Or do you see another option how to get a symmetric amount of results from both DWDObservationSites
and DWDObservationData
in a deterministic manner?
Thanks for the detailed help! I will try to use the collect_data
method.
For the missing observations, I would at least want to get a warning and some information on which station has missing data. Returning nan
would be most convenient, since that is what could be used for masking. For now I have to compare the output station ID array with the input stations to determine which station should be neglected. That's a pain!
Nevertheless: This is a very nice package! Will watch you growing :wink:
Cheers, Sebastian
Dear Sebastian,
we have just figured out the required changes to make this work with more ease. You'll definitely see some changes towards a consise behavior in the next iterations. At the moment we have couple of things in our minds as you may recognize from the number of issues šš you can be sure that feedback like yours is very welcome as we are only working in a direction we guess is the best. If you may have some spare time we'd also like to welcome you on board, but looking on the number of repositories you are working with I guess you are already quite busy...
That's what I call good service :+1: Yeah I am quite busy, so more support than bug-reports is not possible for now. I am working on a kriging example for GSTools, since the next release will support lat-lon data, and I was looking for real measurements. And eventually I stumbled over this nice package. I can sent you a link, when it's finally in the documentation, if you want :wink:
Is there a plan for the next release of wetterdienst?
PS: using collect_data
didn't improve runtime. :cry:
Hi Sebastian,
thanks for considering to use Wetterdienst for data acquisition within an example for GSTools. We are already honored that @kmuehlbauer did the same thing within a tutorial [1] for the wradlib package.
The next release [of GSTools] will support lat-lon data, and I was looking for real measurements. Is there a plan for the next release of wetterdienst?
I see. I will be happy to collaborate on that how we can bring both worlds together in an optimal manner.
PS: Using collect_data didn't improve runtime.
It won't improve runtime, but it should save memory. If there is more data to process than your machine has memory available, it won't choke on that.
With kind regards, Andreas.
[1] https://docs.wradlib.org/en/stable/notebooks/fileio/wradlib_load_DWD_opendata_volumes.html
Dear @MuellerSeb ,
I'm still working on another issue atm but will try to make your example work and create a next release until sunday.
Meanwhile as soon as your release is done we would like to also include the interpolation scheme into our example here [1]. BTW for the purpose of speed you may as well want to filter for a state e.g. Saxony (Sachsen) this will cut the stations down by a lot but will still be a meaningful result!
sites = obs.DWDObservationSites(
parameter_set=obs.DWDObservationParameterSet.TEMPERATURE_AIR,
resolution=obs.DWDObservationResolution.HOURLY,
period=obs.DWDObservationPeriod.RECENT,
start_date="2020-06-09 12:00:00",
end_date="2020-06-09 12:00:00",
)
df = sites.all()
df = df[df.STATE == "Sachsen"] # or any other known state of Germany
[1] https://github.com/earthobservations/wetterdienst/blob/master/example/climate_observations.ipynb
This is the example in its current state: https://geostat-framework.readthedocs.io/projects/gstools/en/latlon_support/examples/08_geo_coordinates/01_dwd_krige.html#sphx-glr-examples-08-geo-coordinates-01-dwd-krige-py
When there is an update within the next week I can still modify the downloading routines.
Current form of the downloading routine (for the record):
def get_dwd_temperature():
"""Get air temperature from german weather stations from 9.6.20 12:00."""
from wetterdienst.dwd import observations as obs # version 0.10.1
sites = obs.DWDObservationSites(
parameter_set=obs.DWDObservationParameterSet.TEMPERATURE_AIR,
resolution=obs.DWDObservationResolution.HOURLY,
period=obs.DWDObservationPeriod.RECENT,
start_date="2020-06-09 12:00:00",
end_date="2020-06-09 12:00:00",
)
df0 = sites.all()
ids, lat, lon = map(np.array, [df0.STATION_ID, df0.LAT, df0.LON])
observations = obs.DWDObservationData(
station_ids=ids,
parameters=obs.DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
resolution=obs.DWDObservationResolution.HOURLY,
start_date="2020-06-09 12:00:00",
end_date="2020-06-09 12:00:00",
)
df1 = observations.collect_safe()
temp, ids1 = map(np.array, [df1.VALUE, df1.STATION_ID])
select = np.isfinite(temp) # care about missing values
sorter = np.argsort(ids) # care about missing stations
sort = sorter[np.searchsorted(ids, ids1[select], sorter=np.argsort(ids))]
ids, lat, lon, temp = ids[sort], lat[sort], lon[sort], temp[select]
head = "id, lat, lon, temp" # add a header to the file
np.savetxt("temp_obs.txt", np.array([ids, lat, lon, temp]).T, header=head)
Dear Sebastian,
thanks for being able to push this further and for sharing the respective GSTools tutorial. So, these additional lines of code based on NumPy mangling fulfill the goal to compensate for missing stations or readings?
temp, ids1 = map(np.array, [df1.VALUE, df1.STATION_ID])
select = np.isfinite(temp) # care about missing values
sorter = np.argsort(ids) # care about missing stations
sort = sorter[np.searchsorted(ids, ids1[select], sorter=np.argsort(ids))]
ids, lat, lon, temp = ids[sort], lat[sort], lon[sort], temp[select]
Indeed, it would be nice if Wetterdienst could include an appropriate compensation on the Pandas level in order to yield the same results - right?
With kind regards, Andreas.
That's correct.
select = np.isfinite(temp) # care about missing values
should actually be enough to filter out missing data. But now I have to look up the ids1
in the original station IDs array. And that always looks ugly with numpy (it is totally unobvious what is happening with this argsort-sorter stuff).
Edit: To be precise, I would like to see ids1
and ids
to be always identical!
Dear @MuellerSeb ,
with the latest release you will now get a Result that matches your requested stations.
@gutzbenj cool! Going to test it :wink:
@gutzbenj could you give me a hint how to properly rewrite the script I posted above?
It's not working out of the box and with renaming "DWDObservationStations" I now run into a numpy TypeError when using "savetxt" .. guess output was changed, so I got a shape missmatch which results on dtype=object. It is also still taking 15 min to execute, but I think that is still more a downloading problem.
I'd write it this way:
import numpy as np
from wetterdienst.dwd.observations import DWDObservationStations, DWDObservationData, \
DWDObservationParameterSet, DWDObservationParameter, DWDObservationResolution, \
DWDObservationPeriod
stations = DWDObservationStations(
parameter_set=DWDObservationParameterSet.TEMPERATURE_AIR,
resolution=DWDObservationResolution.HOURLY,
period=DWDObservationPeriod.RECENT,
start_date="2020-06-09 12:00:00",
end_date="2020-06-09 12:00:00",
)
df = stations.all()
ids, lat, lon = df.loc[:, ["STATION_ID", "LAT", "LON"]].values.T
observations = DWDObservationData(
station_ids=ids,
parameters=DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
resolution=DWDObservationResolution.HOURLY,
start_date="2020-06-09 12:00:00",
end_date="2020-06-09 12:00:00",
)
temp = observations.all().VALUE.values
ids = np.array(ids, dtype=int) # convert back
head = "id, lat, lon, temp"
np.savetxt("temp_obs.txt", np.array([ids, lat, lon, temp]).T, header=head)
The TypeError is caused by STATION_ID which was changed to string (STATION_ID is now generally mapped as string instead the integer it was before).
I am not sure that we can drastically improve the downlaod time as the request will have to download hourly data of ~500 stations for the RECENT period (last 500 days) due to the request date. This totals in 6_000_000 values only for temperature while we still have to download the entire parameter set and extract the single parameter. It will also download historical data as of the date being in the LAST year. You can improve the speed by filtering additionally by period setting it to RECENT as long as your requested date lays in that time span (June of 2020 is withing the 500 days span) e.g.
observations = DWDObservationData(
station_ids=ids,
parameters=DWDObservationParameter.HOURLY.TEMPERATURE_AIR_200,
resolution=DWDObservationResolution.HOURLY,
periods=DWDObservationPeriod.RECENT,
start_date="2020-06-09 12:00:00",
end_date="2020-06-09 12:00:00",
)
This took 1.30 minutes on my machine.
Ahh cool! Thanks!
PS: wanted to create a settings dict for both calls and stumbled over periods
vs. period
in DWDObservationData/Station.
We are aware of this and plan to use only singular argument names in the future! Thanks again for the feedback!
Describe the bug I experience very long execution times for the following script and get a wrong number of observations:
After ~15min I get an Error, because there are 496 stations but only 494 temp values, so numpy can't save this to a text file.
Expected behavior When truncating the station array, it works fine:
Desktop (please complete the following information):
Am I doing something wrong?
Cheers, Sebastian