datonic / datadex

📦 Serverless and local-first Open Data Platform
http://datadex.datonic.io
MIT License
220 stars 14 forks source link

Add new dataset: Spain Water Reservoirs #61

Closed franloza closed 1 month ago

franloza commented 1 month ago

Why?

I think that adding this dataset to the repository is interesting to allow citizens and researchers to have easy access to information about water availability, usage, and management practices.

What?

Currently, the data about water reservoirs is open via Boletín Hidrológico Nacional, provided by MITECO (Ministerio para la Transición Ecológica y el Reto Demográfico).

However, up-to-date data are not available to download. You have these options to consume these data:

  1. Access the Online version, which does not provide a way to download the data.
  2. Visualise the data in an ArcGIS Dashboard. Although this tool is pretty useful to visualise the data, it does not allow to download the data either.
  3. Download historical data in Microsoft Access format (.mdb). This file includes data only up to 2022, which is not useful to know data from the latest years.

How?

Investigating a little about the second option (ArcGIS Dashboard), I discovered an endpoint that returned data in JSON format when passing a series of parameters: GET https://services-eu1.arcgis.com/RvnYk1PBUJ9rrAuT/arcgis/rest/services/Embalses_Mapa/FeatureServer/0/query. With this endpoint, I could extract all the data from this dataset. To know the parameters accepted by this endpoint, I used the following form.

I wrote the pipeline to be "friendly" with the API and divided the requests in chunks of 1 year, although is possible to extract all the data by passing 1=1 in the where clause (Not very nice to do).

In addition, I configured MITECOArcGisAPI resource to accept a dataset name as it's possible to access other datasets in the ArcGIS server. The complete list can be found here. This server allows to enter to any dataset, take a look at the data and visualise the dataset in a map.

Here's for example the map of water reservoirs in Spain with the most recent data:

image

I'm not sure whether this is a misconfiguration on the server, or they really allowed these resources to be publicly accessed 😅

Testing the data

con.query("DESCRIBE main.spain_water_reservoirs_data").df()

                     column_name   column_type null   key default extra
0                     OBJECTID_1        BIGINT  YES  None    None  None
1                     EMBALSE_ID        BIGINT  YES  None    None  None
2                 embalse_nombre       VARCHAR  YES  None    None  None
3                      ambito_id        BIGINT  YES  None    None  None
4                     agua_total        BIGINT  YES  None    None  None
5                  ambito_nombre       VARCHAR  YES  None    None  None
6                    agua_actual        BIGINT  YES  None    None  None
7                          fecha  TIMESTAMP_NS  YES  None    None  None
8                   boletin_anyo        BIGINT  YES  None    None  None
9                    boletin_num        BIGINT  YES  None    None  None
10                           Uso       VARCHAR  YES  None    None  None
11            Porcentaje_Reserva        DOUBLE  YES  None    None  None
12                          Años       VARCHAR  YES  None    None  None
13                  Orden_Semana        BIGINT  YES  None    None  None
14                     Fecha_str       VARCHAR  YES  None    None  None
15             Variacion_Reserva        BIGINT  YES  None    None  None
16                      ID_Unico       VARCHAR  YES  None    None  None
17                   Estado_Porc       VARCHAR  YES  None    None  None
18     Estado_Porcentaje_Energia       VARCHAR  YES  None    None  None
19                energia_actual        BIGINT  YES  None    None  None
20                 energia_total        BIGINT  YES  None    None  None
21          Variacion_Porcentaje        DOUBLE  YES  None    None  None
22             Variacion_Energia        BIGINT  YES  None    None  None
23            Porcentaje_Energia        DOUBLE  YES  None    None  None
24  Variacion_Porcentaje_Energia        DOUBLE  YES  None    None  None
25                  embalse_id_1        BIGINT  YES  None    None  None
26                electrico_flag        BIGINT  YES  None    None  None
27                      OBJECTID        BIGINT  YES  None    None  None
28                      ORIG_FID        BIGINT  YES  None    None  None
con.query("SELECT * FROM main.spain_water_reservoirs_data limit 1").df().iloc[0]

OBJECTID_1                                     8534
EMBALSE_ID                                       94
embalse_nombre                               Cenajo
ambito_id                                         7
agua_total                                      472
ambito_nombre                                Segura
agua_actual                                      35
fecha                           1988-01-05 00:00:00
boletin_anyo                                   1988
boletin_num                                       1
Uso                                      Consuntivo
Porcentaje_Reserva                         7.415254
Años                                    1987 - 1988
Orden_Semana                                     15
Fecha_str                                        01
Variacion_Reserva                                35
ID_Unico                               9405/01/1988
Estado_Porc                                Positivo
Estado_Porcentaje_Energia                  Positivo
energia_actual                                    8
energia_total                                   103
Variacion_Porcentaje                       7.415254
Variacion_Energia                                 8
Porcentaje_Energia                          7.76699
Variacion_Porcentaje_Energia                7.76699
embalse_id_1                                    NaN
electrico_flag                                  NaN
OBJECTID                                        NaN
ORIG_FID                                        NaN
Name: 0, dtype: object
con.query("SELECT ambito_nombre, avg(agua_actual) as avg_water FROM main.spain_water_reservoirs_data group by 1 order by 2 desc limit 10")

┌──────────────────────────────┬────────────────────┐
│        ambito_nombre         │     avg_water      │
│           varchar            │       double       │
├──────────────────────────────┼────────────────────┤
│ Duero                        │ 139.04117934876967 │
│ Guadiana                     │ 131.52262934327538 │
│ Tajo                         │  119.4253619150748 │
│ Guadalete-Barbate            │  99.01707542194093 │
│ Ebro                         │  76.85396440479799 │
│ Guadalquivir                 │  70.91350053469462 │
│ Miño - Sil                   │  68.73068529026627 │
│ Cuencas Internas de Cataluña │  52.60488114104596 │
│ Cuenca Mediterránea Andaluza │  50.49281032964883 │
│ Júcar                        │  49.94803268266265 │
├──────────────────────────────┴────────────────────┤
│ 10 rows                                 2 columns │
└───────────────────────────────────────────────────┘
davidgasquez commented 1 month ago

Woah @franloza, this is awesome! I was actually looking for this kind of data a while back.

Thanks so much for taking the time to write such a great PR description.

I wrote the pipeline to be "friendly" with the API and divided the requests in chunks of 1 year, although is possible to extract all the data by passing 1=1 in the where clause (Not very nice to do).

Nice! I think that should be ok, although API stability varies a lot in these sort of projects (I was just dealing with issues around the AEMET API as it seems be down or erroring and has been like that for the last few days).

I'm not sure whether this is a misconfiguration on the server, or they really allowed these resources to be publicly accessed

I think these must be public but not 100% sure! :see_no_evil:

davidgasquez commented 1 month ago

Dataset now on HuggingFace!

Also did a silly query-as-avg-from-'https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fdavidgasquez%2Fspain_water_reservoirs_data%2Fresolve%2Fmain%2Fdata%2Fmain%2000000%20of%2000001.parquet'-group-by-1-order-by-1-desc-limit-10~) to see that everything worked.

image

Thanks again! Will spend more time with this dataset and perhaps add a dbt model to smooth those variable names.

franloza commented 1 month ago

Anytime! Happy to contribute 😃

NOTE: TIL about DuckDB Shell. What a nice too to share queries on online datasets 💯