Open FranckCo opened 2 years ago
Italian Air quality data Data extraction Step1: Data source website Step2: Select DATA panel; data are organized in a set of tables. Step3: Scroll to the requested table, named “Tabella 1 – PM10. Stazioni di monitoraggio: dati e parametri statistici per la valutazione della qualità dell'aria (2019)”. Step4: Download link available on the left bottom at the end of the table. These steps were repeated to extract the datasets of the other requested pollutans: AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2 AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3 AMBIENT AIR QUALITY: PARTICULATE PM2.5
Data transformation The downloaded file has the following Data Structure: “Regione”,”Provincia”,”Comune”,”Nome della stazione Tipo di zona”,”Tipo di stazione”,”Giorni di superamento di 50 µg/m3”,”Valore medio annuo³ [µg/m³]”,”Rendimento [%]”,”Rispetta copertura minima”,”sufficiente distribuzione temporale nell'anno”,”numero_dati_validi”,”TIPO DI DATI 4”,”Codice zona”,”Nome zona”.
The Data transformation phase was applied only to the dataset related to the PM10 pollutant. Transformation script in R language: processing_ETL_AIR.R.txt
Data loading The extracted datasets were uploaded to the FTP area of the project:
Possible integration in the French Air pollution datasets The French dataset about the PM10 taken from the European Environmental Agency and uploaded to the FTP server, in its initial version contains the geographic coordinates; it has been enriched with the Municipality value through a script in java using the specific service/API. The file obtained is available on the FTP area of the project. After receiving confirmation that the extrapolated French dataset is correct, in the same way we will also extrapolate the datasets of the other pollutants and it will be possible to add the field referring to the Municipality to them as well.
Here is the direct link to the data for France, 2019 and PM10. Now we have to find a way to automate the "Download CSV" button.
Regarding the Nominatim API for geocoding, the problem is that it does not return the LAU (commune code), only the postal code, which is not the same (see example).
Also, the file on FTP does not seem to be UTF-8.
Several themes regarding Air Quality are mixed together in ISPRA website We actually don't know whether ISPRA has data for any wanted pollutant or not. Here are the download link we could find. PM10: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/448/TABELLA%201_PM10_2019_rev.xlsx PM2.5: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/452/TABELLA%201_PM25_2019_rev_0.xlsx NO2: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/450/TABELLA1_NO2_2019.xlsx O3: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/451/TABELLA%201_O3_SALUTE_2019.xlsx Url can hardly be made from rules. The only standard part is the prefix: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/{theme_number} where themenumber is loosely related to the pollutant and not to some standard classification The last part of the url recites loosely as follow: /TABELLA[ ]1{pollutant}_{referenced_year}[revision].xlsx I don't know if these information can be used to construct a download link on the fly, but, at least they're useful to extract metadata about the dataset. namely the pollutant name and the year of reference, which, in turn are not present as columns in the dataset.
Dataset structure also differs from file to file. Mappings will be provided shortly
NB: We could switch to other more easy to manage data sources, such as the same EEA portal used to fetch French data if we just wanted to test the pipeline process. Different data sources like ISPRA give us a better use case to test data integration process as well.
About ISPRA data file harmonization. Dataset structure also differs from file to file, so the actual mapping needs a little reworking. Mapping is explained in the file attached Air Pollution meta.xlsx
There are three sections:
Design and implement data workflow.