INTERSTAT / Statistics-Contextualized

Models for the dissemination of contextualized statistical data
6 stars 3 forks source link

SEP data workflow: air quality data #17

Open FranckCo opened 2 years ago

FranckCo commented 2 years ago

Design and implement data workflow.

francescadag commented 2 years ago

Italian Air quality data Data extraction Step1: Data source website Step2: Select DATA panel; data are organized in a set of tables. Step3: Scroll to the requested table, named “Tabella 1 – PM10. Stazioni di monitoraggio: dati e parametri statistici per la valutazione della qualità dell'aria (2019)”. Step4: Download link available on the left bottom at the end of the table. These steps were repeated to extract the datasets of the other requested pollutans: AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2 AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3 AMBIENT AIR QUALITY: PARTICULATE PM2.5

Data transformation The downloaded file has the following Data Structure: “Regione”,”Provincia”,”Comune”,”Nome della stazione Tipo di zona”,”Tipo di stazione”,”Giorni di superamento di 50 µg/m3”,”Valore medio annuo³ [µg/m³]”,”Rendimento [%]”,”Rispetta copertura minima”,”sufficiente distribuzione temporale nell'anno”,”numero_dati_validi”,”TIPO DI DATI 4”,”Codice zona”,”Nome zona”.

  1. Data need to be filtered in order to be compliant to the requested Data Structure,
  2. NUTS3 variable has been added through a transformation from municipality_id Variable, using data from ISTAT LAU archive
  3. Provided metadata for NUTS3 transformation need to be downloaded and merged.
    Metadata are referenced in a time series and Variable regarding year 2019 has been used in the script.
  4. Metadata regarding pollutant type, data reference time and aggregation type have been added in the datafile.

The Data transformation phase was applied only to the dataset related to the PM10 pollutant. Transformation script in R language: processing_ETL_AIR.R.txt

Data loading The extracted datasets were uploaded to the FTP area of the project:

Possible integration in the French Air pollution datasets The French dataset about the PM10 taken from the European Environmental Agency and uploaded to the FTP server, in its initial version contains the geographic coordinates; it has been enriched with the Municipality value through a script in java using the specific service/API. The file obtained is available on the FTP area of the project. After receiving confirmation that the extrapolated French dataset is correct, in the same way we will also extrapolate the datasets of the other pollutants and it will be possible to add the field referring to the Municipality to them as well.

FranckCo commented 2 years ago

Here is the direct link to the data for France, 2019 and PM10. Now we have to find a way to automate the "Download CSV" button.

Regarding the Nominatim API for geocoding, the problem is that it does not return the LAU (commune code), only the postal code, which is not the same (see example).

Also, the file on FTP does not seem to be UTF-8.

pafrance commented 2 years ago

Several themes regarding Air Quality are mixed together in ISPRA website We actually don't know whether ISPRA has data for any wanted pollutant or not. Here are the download link we could find. PM10: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/448/TABELLA%201_PM10_2019_rev.xlsx PM2.5: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/452/TABELLA%201_PM25_2019_rev_0.xlsx NO2: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/450/TABELLA1_NO2_2019.xlsx O3: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/451/TABELLA%201_O3_SALUTE_2019.xlsx Url can hardly be made from rules. The only standard part is the prefix: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/{theme_number} where themenumber is loosely related to the pollutant and not to some standard classification The last part of the url recites loosely as follow: /TABELLA[ ]1{pollutant}_{referenced_year}[revision].xlsx I don't know if these information can be used to construct a download link on the fly, but, at least they're useful to extract metadata about the dataset. namely the pollutant name and the year of reference, which, in turn are not present as columns in the dataset.

Dataset structure also differs from file to file. Mappings will be provided shortly

NB: We could switch to other more easy to manage data sources, such as the same EEA portal used to fetch French data if we just wanted to test the pipeline process. Different data sources like ISPRA give us a better use case to test data integration process as well.

pafrance commented 2 years ago

About ISPRA data file harmonization. Dataset structure also differs from file to file, so the actual mapping needs a little reworking. Mapping is explained in the file attached Air Pollution meta.xlsx

There are three sections:

  1. A set of variables of interest need to be extracted from the files, when available. Desired Column Name is stated on the leftmost column, complete with description Actual column name in Italian is provided for any of the input files <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
CSV final column name | Description -- | -- Region | Region code Department | Department code Municipality | Municipality code StationId | Sensor station id RegionName | Region Name DepartmentName | Department name MunicipalityName | Municipality name StationName | Sensor Station Name Aqvalue | Average annual value

  1. A set of variable can be ignored completely These columns are described for clarifying, but are not important and can be left out of the process
  2. Last set of values are constants given as metadata in the file or in the metadata description which are not always provided. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
CSV final column name | Description -- | -- Pollutant | Pollutant notation - constant data extracted from filename PollutantDesc | Pollutant description - extracted from standard AQ Model definition AggregationType | Aggregation level code - extracted from dataset AggregationDesc | Aggregation level definition - extracted from standard AQ Model definition ReferenceYear | Sampling Year - constant data extracted from filename MeasureUnit | Measure unit - constant data extracted from column description Country | Country - constant data extracted from web source NUTS3 | Department level (NUTS) - use NUTS3_LAU_IT metadata for department level join and conversion

francescadag commented 2 years ago

@FranckCo To automate the download, I saw that an API is called to get the resource that has this format: http://aidef.apps.eea.europa.eu/tools/download?download_query=http://aidef.apps.eea.europa.eu/?source={"query":{"bool":{"must":[{"term":{"CountryOrTerritory":"France"}},{"term":{"ReportingYear":"2019"}},{"term":{"Pollutant":"Particulate+matter+<+10+µm+(aerosol)"}}]}}}&download_format=csv The "source" field is a json that can be easily edited to get the other files.

Regarding the service for geocoding, in addition to the Municipality field, is also needed the LAU? I could add it, I just need to have the table of the French LAUs, these are the ones found on the Eurostat website.

I checked the files on the server and converted the non-UTF-8 ones, reorganized them and the pollutant files are now here. The Italian ones are in CSV format, if necessary I will also upload those in Excel format.

pafrance commented 2 years ago

Dear Franck, another option is to use European source both for Territorial metadata (Nuts3/Lau) and for Air Pollution Dataset retrieval. We tested Francesca's url above with "Italy" instead of "France" and it works. This is a huge simplification to test the automated pipeline. In the mean time we will use Ontology Integration for PM10 dataset to link Italian ISPRA and EAA French dataset only. We could compare the outcomes later.

pafrance commented 2 years ago

Update regarding territorial representation mismatch in AIR Quality data:

  1. IT: ISPRA sources uses ISTAT codes for Administrative Territorial Unit representation
  2. FR: EAA source uses geo coordinates.

These codes must be converted into NUTS3 + LAU via custom ETL for each file.

  1. IT: Metadata from Eurostat have been provided for administrative codes conversions.
  2. FR: A service for translating geo coordinates into administrative names has been suggested too.

Both files must have two additional columns where actual NUTS3 and LAU codes are stored Then, Ontological framework maps NUTS3 and LAUs accordingly.

francescadag commented 2 years ago

Added the LAU and NUTS3 fields to the French Air pollution datasets using python script (which was sent to Franck). It obtains the datasets using the API described here, the Municipality name field is added using the following service and subsequently the corresponding LAU and NUTS3 are extrapolated from it from the Eurostat Excel file (of 2019). The files obtained are on the FTP server here.

pafrance commented 2 years ago

SEP process steps

Step 1; Data acquisition

Source files are loaded from ftp repository to MySQL local database FR source file(s) FR PM10 pollution: (this file has been previously elaborated, adding LAU and NUTS3 by a dedicated service) FR Census File: IT source file(s) IT PM10 pollution: IT Census File: Istat website Metadata: Nuts3_LAU taken from Eurostat Air Pollution codelists taken from EAA (example ) Age Classes and Sex codelist merged between IT and FR alike

Common data model (table or file structure) as specified here https://github.com/INTERSTAT/Statistics-Contextualized/issues/17#issuecomment-1029201619 **Variable_name,Description,data_type, StationLocalId,'Sensor station ID',text, pollutant,'pollutant code codelist from EAA website',text, COUNTRY,'country of reference',text, Reporting_YEAR,'year of reference',number, AggregationType,'data aggregation codelist from EAA website','text, AQValue,'observed value',number id,'observetion key',number lauCode,'Municipality code',text (can be easily linked to territory metadata like NUTS3 and geonames) concentration,'codelist if unity of measure',text

Input files are uploaded from ftp source as is. Harmonization has been done via sql union queries wrapped in a single view

CREATE ALGORITHM=UNDEFINED DEFINER=root@localhost SQL SECURITY DEFINER VIEW v_observations AS select obs.StationLocalId AS StationLocalId, obs.pollutant AS pollutant, obs.COUNTRY AS COUNTRY, obs.Reporting_YEAR AS Reporting_YEAR, obs.AggregationType AS AggregationType, obs.AQValue AS AQValue, obs.spId AS spId,row_number() OVER (ORDER BY obs.StationLocalId ) AS id, obs.lauCode AS lauCode, obs.concentration AS concentration from ( select substr(air_pollution_it.MUNICIPALITY_ID,(length(air_pollution_it.province_id) + 1)) AS lauCode, air_pollution_it.station_code AS StationLocalId, 'PM10' AS pollutant,'IT' AS COUNTRY, 2019 AS Reporting_YEAR,'Annual mean / 1 calendar year' AS AggregationType, air_pollution_it.VALORE_MEDIO_ANNUO AS AQValue,NULL AS spId, 'µg/m3' AS concentration from air_pollution_it union all select substr(air_pollution_it.MUNICIPALITY_ID, (length(air_pollution_it.province_id) + 1)) AS lauCode, air_pollution_it.station_code AS StationLocalId, 'PM10' AS pollutant, 'IT' AS COUNTRY, 2019 AS Reporting_YEAR, '1 day exceed 180' AS AggregationType, air_pollution_it.Giorni_di_superamento AS AQValue, NULL AS spId, NULL AS concentration from air_pollution_it union all select a.LAU AS lauCode, a.StationLocalId AS StationLocalId, b.Notation AS Pollutant, c.cod AS COUNTRY, a.ReportingYear AS Reporting_YEAR, a.AggregationType AS AggregationType, a.AQValue AS AQValue, a.SamplingPointLocalId AS spId, a.Unit AS concentration from ( ( air_pollution_fr a join pollutant b on((a.Pollutant = b.Label))) join country c on((a.CountryOrTerritory = c.description))) ) obs

This query has 3 nested sub queries, one for french data and two for Italian data, each for a single measure type. The resultset has the form of a data cube. Codelists and measures have been normalized and linked to their corresponding metadata

CREATE ALGORITHM=UNDEFINED DEFINER=root@localhost SQL SECURITY DEFINER VIEW v_cens_obs AS select row_number() OVER (ORDER BY d.nuts3,d.lau,d.age ) AS id, d.lau AS lau, d.nuts3 AS nuts3, d.sex AS sex, d.age AS age, d.population AS population, d.country AS country from ( select lpad(c.lau,6,0) AS lau, c.nuts3 AS nuts3, c.sex AS sex, c.age AS age, c.population AS population, c.country AS country from cens_it c union all select a.lau AS lau, b.NUTS_3_CODE AS nuts3, a.sex AS sex, a.age AS age, a.population AS population, 'FR' AS country from (cens_fr a join lau_nuts3 b on((a.lau = b.LAU_CODE))) ) d;

This query integrates files from different countries as one single harmonized view.

MySQL is used as data repository for monolith, the tool for mappings. This tool associates mappings with sql queries on the mysql database, so that SparQL queries can be translated automatically into SQL queries. The tool Monolith can export them in xml format, although this is hardly reusable elsewhere.

The sparql resultset can be formatted in csv, json and rdf and sent to the subsequent stages of the pipeline.