earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
349 stars 54 forks source link

Faulty DWD stations metadata when name contains comma #1257

Closed joshuarrrrr closed 5 months ago

joshuarrrrr commented 5 months ago

Describe the bug When fetching the stations metadata for DWD observations, some stations are missing the state value and part of the name. This seems to affect stations with a comma in their name due to a prefix being moved to the end of the name, for example "Harzburg, Bad". This causes the station name to be cut off at the comma, so it would become "Harzburg", and the state name to be missing completely.

To Reproduce Using version 0.78.0 with the following code:

import polars as pl
from wetterdienst.provider.dwd.observation import DwdObservationRequest

request = DwdObservationRequest(
    parameter="kl",
    resolution="monthly",
    period="recent",
)
stations = request.all()
stations.df.filter(pl.col("state").is_null())

It results in 30 stations with a cut off name and null state value.

Expected behavior The full station name and state as present in the DWD descriptions txt file.

Desktop (please complete the following information):

Additional context I loaded the stations using pandas for comparison with the following code:

stations.df.join(
            pl.from_pandas(
                pd.read_fwf(
                    "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/recent/KL_Monatswerte_Beschreibung_Stationen.txt",
                    encoding="latin-1",
                    skiprows=[0, 1],
                    names=stations.df.columns,
                    dtype={"station_id": "string"},
                )
            ).select(pl.col("station_id", "name", "state")),
            on="station_id",
        )
        .filter(pl.col("state").is_null())
        .select(["station_id", "name", "state", "name_right", "state_right"])
This shows which stations are affected by the bug and what the name and state values should be: station_id name state name_right state_right
00314 Kubschütz null Kubschütz, Kr. Bautzen Sachsen
00377 Bergzabern null Bergzabern, Bad Rheinland-Pfalz
00379 Berka null Berka, Bad (Flugplatz) Thüringen
00390 Berleburg null Berleburg, Bad-Stünzel Nordrhein-Westfalen
00755 Buchen null Buchen, Kr. Neckar-Odenwald Baden-Württemberg
01072 Dürkheim null Dürkheim, Bad Rheinland-Pfalz
01207 Elster null Elster, Bad-Sohl Sachsen
01332 Falkenberg null Falkenberg,Kr.Rottal-Inn Bayern
02039 Harzburg null Harzburg, Bad Niedersachsen
02171 Hersfeld null Hersfeld, Bad Hessen
02323 Bevern null Bevern, Kr. Holzminden Niedersachsen
02597 Kissingen null Kissingen, Bad Bayern
02680 Königshofen null Königshofen, Bad Bayern
02708 Kohlgrub null Kohlgrub, Bad (Rosshof) Bayern
02878 Lauchstädt null Lauchstädt, Bad Sachsen-Anhalt
03028 Lippspringe null Lippspringe, Bad Nordrhein-Westfalen
03034 Lobenstein null Lobenstein, Bad Thüringen
03164 Cölbe null Cölbe, Kr. Marburg-Biedenkopf Hessen
03167 Marienberg null Marienberg, Bad Rheinland-Pfalz
03257 Mergentheim null Mergentheim, Bad Baden-Württemberg
03426 Muskau null Muskau, Bad Sachsen
03442 Nauheim null Nauheim, Bad Hessen
03490 Neuenahr null Neuenahr, Bad-Ahrweiler Rheinland-Pfalz
04094 Weingarten null Weingarten, Kr. Ravensburg Baden-Württemberg
04189 Altheim null Altheim, Kreis Biberach Baden-Württemberg
04301 Kreuznach null Kreuznach, Bad Rheinland-Pfalz
04371 Salzuflen null Salzuflen, Bad Nordrhein-Westfalen
04813 Staffelstein null Staffelstein, Bad-Stublang Bayern
04841 Steinau null Steinau, Kr. Cuxhaven Niedersachsen
06272 Salzungen null Salzungen, Bad-Gräfen-Nitzend Thüringen
gutzbenj commented 5 months ago

Dear @joshuarrrrr ,

thanks for reporting this issue! You're absolutely right! I mistakenly used read_csv here in combination with column specs not thinking of actual commata in the data... With the latest commit I changed that and the names should now be correct.

joshuarrrrr commented 5 months ago

Thanks for the swift fix! I can confirm it's now working perfectly as expected in v0.79.0.