INTERSTAT / Statistics-Contextualized

Models for the dissemination of contextualized statistical data
6 stars 3 forks source link

SEP data workflow: Italian Air Pollution datasets #18

Closed pafrance closed 2 years ago

pafrance commented 2 years ago

Data extraction Step1: data source website Step2: Select DATA panel. Data are organized in a set of tables
Step3: Scroll to the requested table, named “Tabella 1 – PM10. Stazioni di monitoraggio: dati e parametri statistici per la valutazione della qualità dell'aria (2019)”
Step4: Download link available on the left bottom at the end of the table . Downloaded data are in xls format The downloaded file is not compliant with the required Data Structure.

Data transformation The downloaded file has the following Data Structure: “Regione”,”Provincia”,”Comune”,”Nome della stazione Tipo di zona”,”Tipo di stazione”,”Giorni di superamento di 50 µg/m3”,”Valore medio annuo³ [µg/m³]”,”Rendimento [%]”,”Rispetta copertura minima”,”sufficiente distribuzione temporale nell'anno”,”numero_dati_validi”,”TIPO DI DATI 4”,”Codice zona”,”Nome zona”

  1. Data need to be filtered in order to be compliant to the requested Data Structure,
  2. NUTS3 variable has been added through a transformation from municipality_id Variable, using data from ISTAT LAU archive
  3. Provided metadata for NUTS3 transformation need to be downloaded and merged.
    Metadata are referenced in a time series and Variable regarding year 2019 has been used in the script.
  4. Metadata regarding pollutant type, data reference time and aggregation type have been added in the datafile.

Data Load The transformed file has been uploaded into INTERSTAT GraphDB repository sep-test GraphDB allows direct link to the resources by a permalink, but the raw data needs a little reworking to be accessed directly.

Further data files available Same procedure can be used to import other data from Data Source Website AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2 AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3 AMBIENT AIR QUALITY: PARTICULATE PM2.5 These files have not been uploaded to GraphDB repository yet

Transformation script in R language processing_ETL_AIR.R.txt

francescadag commented 2 years ago

The Italian Air Pollution dataset concerning PM 10 is available on the FTP area of the project. The AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2 is available on the FTP area of the project. The AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3 is available on the FTP area of the project. The AMBIENT AIR QUALITY: PARTICULATE PM2.5 is available on the FTP area of the project. All files are available in tabular format.

pafrance commented 2 years ago

French Data are available from European Environmental Agency. We downloaded a sample of data (PM10) now available on the ftp area of the project. Please, we need a confirmation wether we got the right datasets or not.

The source data model is consistent with respect to both the Italian and French data except for the data conversion of geo spatial coordinates to administrative units. But the target data model used for context broker ingestion is not compatible, as it is used for smart data models. We would like to explore this issue separately (Issue 21) I would like to leave this issue for data preparation and integration, because dataset can be integrated as they are, but if we consider smart model, there's a need to extract and convert data to the target model, adding a step in the process.

francescadag commented 2 years ago

The French dataset about the PM10 mentioned in the previous post and which was uploaded to the FTP server in its initial version contains the geographic coordinates; it has been enriched with the Municipality value through a script in java using the specific service/API. The file obtained is available on the FTP area of the project. After receiving the confirmation that the extrapolated French dataset is correct, in the same way we will extrapolate also the datasets of the other pollutants and I will also add to them the field that refers to the Municipality.