INTERSTAT / Statistics-Contextualized

Models for the dissemination of contextualized statistical data
6 stars 3 forks source link

GF data workflow #12

Open FranckCo opened 2 years ago

FranckCo commented 2 years ago

Finish implementation of the GF data workflow

pafrance commented 2 years ago

GF (Geolocalized Facilities) Pilot Roadmap

Italian data sources, ontologies, tools Italian data have been selected starting from the two specific user stories defined for the GF pilot: • In the “visitor” case, we consider a user visiting a place she does not know and wondering where the nearest facilities of different types are located. She also would like to know what events are programmed in the nearby stadiums, theatres of cultural venues. From the description of locations or events, it should be simple to navigate on the web for further detail (e.g., on artists or sport teams, history of places, links to the locations’ web sites, etc.). • The “local decider” story is about a person in charge of an investment decision at a local level. It can be the manager of a bus company wondering if he should replace an old vehicle, an employee of an educational public service assessing the creation of a new class in a community school, or a young couple thinking of moving to a rural place, etc. He needs information about the level and capacity of the equipment in the neighbourhood, linked with data on the demographic evolution at a fine level. He will probably need to combine that information with other sources more specifically relevant to his specific problem. See deliverable for reference

1.Italian Cultural Facilities and Events The MiBACT presents a new web-site in which data are also available in LOD and Open Data format. MiBACT Open Data are described in RDF format compliant with the DCAT-AP-IT standard adopted by AGID, with the specific Catalogue. Data can be extracted through SPARQL predefined queries that have been modified and also imported in Idra as SPARQL Datasets. The queries are the following: • Cultural Facilities Data available on the MiBACT SPARQL end-point: The query returns the following Data Structure: Name, description, web-site, geographic coordinates of cultural facilities. The same query has been uploaded in the Idra web-application to facilitate data analysis, transformation and visualization. • Cultural Events Data available on the MiBACT SPARQL end-point: The query returns the following Data Structure: Name, description, web-site, geographic coordinates of cultural events. The same query has been uploaded in the Idra web-application to facilitate data analysis, transformation and visualization.

2.Italian Educational Facilities Data related to the Italian Educational facilities are available in RDF format in the data Catalogue published by the Ministry for Education, University and Research (MIUR) Open data portal. Data can be extracted through a SPARQL predefined query that has been modified and also imported in Idra as a SPARQL Dataset. In this case the original end-point has some limits in terms of extrapolation of the results (it is possible to obtain only 1500 triples at a time). For this reason, an additional step was necessary: in the GraphDB of the project, the gf-test repository was added in which the RDF of the dataset of interest was loaded, which was downloaded from the source, the EDIANAGRAFESTA dataset. The data considered are those of the year "201819". The RDF was uploaded to the gf-test repository. The query was then performed on the triple-store in GraphDB and not on the original one, to overcome this problem. The query is the following: • Italian Educational Facilities Data available on the gt-test GraphDB SPARQL end-point: The query returns the following Data Structure: school year and toponym. The same query has been uploaded in the Idra web-application to facilitate data analysis, transformation and visualization. Services to be implemented and added: o In order to geo-localize the educational establishments, a specific service/API is needed to transform the toponym in geographic coordinates. o It is also necessary to classify the Municipalities according to LAU and NUTS3 codes. All the SPARQL queries used are present in the "SPARQL queries for GF pilot" text file.

SPARQL.Queries.for.GF.pilot.txt

by Pina, Adele, Paolo and Francesca

francescadag commented 2 years ago

The data about the Italian Educational Facilities extracted from the gt-test GraphDB SPARQL end-point has been enriched with geographic coordinates using the specific service/API. The file is available on the FTP area of the project. The accuracy of the address conversion can be further improved.

FranckCo commented 2 years ago

French dataflow first implementation: 5a23b7916bc2c544184f48a54744e8b52ae9f64c.

romaintailhurat commented 2 years ago

@francescadag Hi Francesca!

There is a small error in the target file for educational data [0] : the value for il NumeroCivico at the line 7587 is "2 where it should only be 2.

image

It'a small discrepancy but it raises an error when read with Python, can you modify it please ?

Thank you.

[0] this file: https://interstat.eng.it/files/gf/input/it/MIUR%20Schools%20with%20coordinates.csv

francescadag commented 2 years ago

@romaintailhurat Hi Romain, I edited the file and uploaded it to the server. I changed the line you indicated and also similar errors in other lines of the file. The errors and wrong characters in this file unfortunately do not depend on me but on the fact that the original dataset often contains badly written addresses. I fixed some errors through the script but for others there was not much to do because it is the starting address that is wrong written in the dataset; for this reason sometimes the service was not able to obtain the coordinates. Furthermore, when the "ExactLocation" field is equal to "false" but the coordinates are there, it means that the service was able to obtain coordinates starting from the address but this did not contain the street number ("NumeroCivico"); so this indicates that the coordinates are there but they are not exactly of the school but, in any case, of the street in which it is located.

Updated file: https://interstat.eng.it/files/gf/input/it/MIUR%20Schools%20with%20coordinates.csv

romaintailhurat commented 2 years ago

Thank you Francesca, but there are still several issues in the file. I'll try to finalise the data cleaning by myself and provide you with the corrected file. And yes, i understand the problem lies in the source file 😄.

francescadag commented 2 years ago

Hi Romain, I also modified the script further in order to improve the file, despite the errors of the starting addresses. I launched the script and as soon as it finishes I update you and upload the file to the server.

If you will notice other issues and you have already finalized the data analysis, we could discuss it in more detail by e-mail or on a call to find a meeting point to optimize the data file.

romaintailhurat commented 2 years ago

Hi Francesca

Thanks for the follow up! I think i have a fully corrected file now, i'm sending it to you by mail for uploading.

francescadag commented 2 years ago

Hi Romain, I redid the whole file and i'm sending it to you by mail, so we can decide which of the two files is appropriate to use. In my new file I noticed that there is a greater number of coordinates found because I made some changes and also I added the "CodiceComune" field which is the code associated with the municipality which could be useful later

romaintailhurat commented 2 years ago

Hi Francesca. This new file you sent me is good! 👍

Can you upload it on the FTP ? Thank you.

francescadag commented 2 years ago

Ok great, the file is here https://interstat.eng.it/files/gf/input/it/MIUR_Schools_with_coordinates.csv

romaintailhurat commented 2 years ago

It works perfectly now, thanks once again @francescadag