INTERSTAT / Statistics-Contextualized

Models for the dissemination of contextualized statistical data
6 stars 3 forks source link

Specify S4Y data workflow #14

Open FranckCo opened 2 years ago

FranckCo commented 2 years ago

Specify the data location, formats, ETL, etc.

francescadag commented 2 years ago

The Italian data to be used for the pilot belong to the MIUR (The Ministry for Education) catalogue, consider both public and private schools, are the following:

The reference dataset for the list and location of schools of all levels is available at the following link and is also considered for the GF pilot. The coordinates of the schools have been added to this dataset and the output file is present in the FTP area of the project.

In addition, we will extract aggregated data about French and Italian student population from the European Data Portal datasets: a selection of datasets available on the European Data Portal, published by Publications Office of the European Union. The Education and training statistics indicators provide information on the participation of individuals in education and training activities, education financing and teaching staff as well as on outcomes of education, view the link. In this expirence, we concentrate on: Participation in Education and Training. The description of data and metadata is available in the Deliverable 2.1, paragraph 3.1.2.

pafrance commented 2 years ago

Data structure for Italian School datasets provide the number of students given the following dimensions:

We need the similar data structure for french data, if available.

ThomasPO commented 2 years ago

Let's consider the following structure:

Field name Description Data type
school_id Unique school identifier string
name Official name of the school string
lambert_x Latitude of the school Float
lambert_y Longitude of the school Float
institution_type Type of institution (Public or Private) Code list
school_level Level of school (Nursery, Primary, Secondary) Code list
scholastic_year Scholastic year Year (from 2019 to 2021)
course_year 0 for Nursery school, 1 for first year of Primary school, 2 for second year of Primary school and so on until the last year of upper secondary shool Code list
students_number Number of students (measure) integer

How to get French data compliant with this structure?

From the school registry dataset (called "Address and geolocalization of primary and secondary educational institutions"), we can get:

To note:

Then, we can get number of students for the three last scholastic year (2019, 2020 and 2021) from 4 datasets :

course_year dimension has to be coded from each measure "number of students by grade" (its exists 1 measure “Number of students” per grade).

pafrance commented 2 years ago

Here's a sketch with data work flow we are implementing Data Flow Ontology_Integration_v3

Data workflow needs the following steps to be implemented:

  1. Define one single Ontology as conceptual data model of common domain of S4Y
  2. Data Harmonization 2.1. Prepare a common data federated archive to harmonize data formats 2.2. Data transformation: i.e. Transform Lambert coordinates to common EU standards WGF84 and then to LAU using the same services developed for the other pilots
  3. Map Ontology to data sources
  4. Use a SparQL query to integrate data
  5. Export integrated dataset in json format for context broker ingestion
francescadag commented 2 years ago

Procedure to extrapolate the French datasets listed by @ThomasPO The platform provides APIs for the datasets extrapolation, whose parameters can be suitably configured to extrapolate only the fields of interest. API used: https://data.education.gouv.fr/api/v2/catalog/datasets/{dataset_id}/exports/{format}

From the school registry dataset (called "Address and geolocalization of primary and secondary educational institutions"), have been extracted: _School id, Name, Lambert_x, Lambert_y, Institution_type, Schoollevel

There is also the Status_code field which in the dataset has the following value: 1 = open; 2 = to be closed; 3 = to open. Only open schools were considered and therefore with Status_code equal to 1. API to get the dataset in CSV. API to get the dataset in RDF.

Datasets from which the number of students for the three last scholastic year (2019, 2020 and 2021) have been extracted with such fields: _Scholastic_year, School_id, Institution_type, Number_of_classes, Number_ofstudents

1) Student enrollment by grade and number of classes per school (only primary and nursery school API to get the dataset in CSV. API to get the dataset in RDF.

2) Number of students by grade, gender, most frequent modern languages 1 and 2, by lower secondary school API to get the dataset in CSV. API to get the dataset in RDF.

3) Number of students by grade, gender, most frequent modern languages 1 and 2, by upper secondary school (school of general and technological education API to get the dataset in CSV. API to get the dataset in RDF.

4) Number of students by grade, gender, most frequent modern languages 1 and 2, by upper secondary school (vocational school API to get the dataset in CSV. API to get the dataset in RDF.

These datasets contain many other fields that may be of interest. If so, they can then be added to the API for extraction. They have been added in the FTP server in RDF format.

pafrance commented 2 years ago

We checked French data models and we need a set of other variables as follows:

  1. Student's sex.
  2. Age class. (0-5 yrs, 5-10 yrs etc)
  3. Student's course of study: This is a complex hierarchy to be defined together based on some sort of standard.
  4. Coordinates should be converted in EU standard
  5. Add NUTS3 and LAU
francescadag commented 2 years ago

I report the following requests for @ThomasPO from a Michele's mail regarding the French data model: 1 - Which is the mapping to recode school level from French classification ? Have we to decode from the french field "Code_nature". Otherwise have you a mapping to recode from french classification to ISCED ? 2 - We have founded in french school dataset also "Position" field in WGS84 international format. Are these better than Lambert coordinates ? 3 - Is french field "Code_commune" the same of LAU code or it must be recoded ? and how? 4 - From column "Dénomination principale" can we derive the type of Study area (kind of Lycee, Technological, ....)

francescadag commented 2 years ago

Upon request, I get the complete datasets concerning the Number of students for the three last scholastic year (2019, 2020 and 2021):

  1. Student enrollment by grade and number of classes per school (only primary and nursery school API to get the dataset in CSV. API to get the dataset in RDF.
  2. Number of students by grade, gender, most frequent modern languages 1 and 2, by lower secondary school API to get the dataset in CSV. API to get the dataset in RDF.
  3. Number of students by grade, gender, most frequent modern languages 1 and 2, by upper secondary school (school of general and technological education API to get the dataset in CSV. API to get the dataset in RDF.
  4. Number of students by grade, gender, most frequent modern languages 1 and 2, by upper secondary school (vocational school API to get the dataset in CSV. API to get the dataset in RDF. The datasets loaded in the previous comment contained only some columns of these starting datasets. They have been added in the FTP server here in CSV and RDF format.
ThomasPO commented 2 years ago

@francescadag when you ask for "Student's sex" and "Age class. (0-5 yrs, 5-10 yrs etc)" variables, do you mean these variables for each school? I can only find "sex" variable for each upper secondary school.

francescadag commented 2 years ago

Hi Tomas, does your request relate to this comment? I therefore think is addressed to @pafrance .

ThomasPO commented 2 years ago

Yes, exactly. It's addressed to @pafrance

pafrance commented 2 years ago

Whenever age class is not available, the "course year" variable can approximate it. Anyhow, this topic has been addressed via mail. Quoted excerpt follows

Hi Thomas

about: when you ask for "Student's sex" and "Age class. (0-5 yrs, 5-10 yrs etc)" variables, do you mean these variables for each >school? I can only find "sex" variable for each upper secondary school. When possible Age class and Sex for each school, but also Italian data have Sex only for upper secondary school.

What level of detail of the ISCED nomenclature do you want? We want only ISCED with only one digit. Otherwise mapping is too hard. If possible we can classify only Upper secondary school so: “general lycées", "technological lycées”, “professional lycées”, >"Formation d'apprentis" Is it possible from “code nature” ? LAU: Code_Commune is European code or French code ?

At least: remember that we have to compare data having the same scolastic year.

Thanks Michele

ThomasPO commented 2 years ago

I answer this comment quoted below to facilitate reading.

I report the following requests for @ThomasPO from a Michele's mail regarding the French data model: 1 - Which is the mapping to recode school level from French classification ? Have we to decode from the french field "Code_nature". Otherwise have you a mapping to recode from french classification to ISCED ? 2 - We have founded in french school dataset also "Position" field in WGS84 international format. Are these better than Lambert coordinates ? 3 - Is french field "Code_commune" the same of LAU code or it must be recoded ? and how? 4 - From column "Dénomination principale" can we derive the type of Study area (kind of Lycee, Technological, ....)

1 - Mapping between each school and ISCED is not possible (answer from French ministry of Eductation) 2 - Yes, you can take field in WGS84 international format instead of Lambert. We will see later if we need to adjust/correct this point 3 - Code_Commune is the same than LAU code. No needed to be recoded. 4 - About “Type of study area” (“general lycées", "technological lycées”, “professional lycées”, "Formation d'apprentis"), we can't use "Denomination" (637 codes). Based on "code nature", it's not possible to map with only one code because some school can be at the same time "general lycées" and "technological lycées" or other combinations.

ThomasPO commented 2 years ago

I try to propose a list of variables we can have for french datasets (see below). If you want some data not available in France, maybe can you have data for Italy and not for France?

Note that we will use the list of current school and for each opened school, we will have the last three scholastic years (of course if the school exists since 3 or more years).

Field name Description Data type Comment
school_id Unique school identifier string variable "numero_uai" in the dataset
name Official name of the school string variable "appellation_officielle
latitude Latitude of the school (WGS84) Float variable "latitude"
longitude Longitude of the school (WGS84) Float variable "longitude"
lau Lau code Code list variable "code_commune"
nuts3 NUTS3 code Code list to be coded
institution_type Type of institution (2 values Public or Private) Code list variable "secteur_public_prive_libe". Has to be coded. Code list to be defined
scholastic_year Scholastic year Year Data available from 2019 to 2021
course_year 0 for Nursery school, 1 for first year of Primary school, 2 for second year of Primary school and so on until the last year of upper secondary shool Code list to be coded: do different cartesian products depending on type of school
sex Gender of the students Code list has to be coded. Code list to be defined. Only available for upper secondary school. Cartersian product to do for these schools. Does it mean we will have a "Total" code for primary and lower secondary schools?
students_number Number of students (measure) per scholastic year, sex and course year integer to be calculated: sum of different measures (the list of measures to be summed differs for each dataset) depending on scholastic year, sex and course year dimensions

Feel free to comment the list.

pafrance commented 2 years ago

Here's an update of the pipeline based on Ontology we are implementing for the pilot. This methodology is generalized and applicable to other pilots as well as a mixed approach based on both ETLs and Ontology mappings.

image

Main steps of the pipeline to be implemented for S4Y pilot:

  1. Data Acquisition - Input Data are downloaded from the source websites into a staging area. 1.1. Input Data Acquisition - Data are downloaded from web sources either by procedure or manually 1.2. DB Upload - Data transfer to a relational database for reprocessing

  2. Data Reprocessing - Data are harmonized to a common data model through ETL, or transformed to create needed variables for data matching. Operation are performed on actual data at this stage. 2.1. Harmonization – Data are harmonized to a common data model through ETL. 2.2. Transformation - Data aggregation and standardization, unit identification, common variables creation according to target data models, creation of new variables Queries and views implement ETL Logical level Processes in the underlying host MYSQL database. Harmonization and Transformation processes can be run in parallel and the result is stored into harmonized datasets.

  3. Conceptual INTEGRATION – Once data are harmonized, they can be integrated. Data are integrated on a conceptual level by mapping onto ontology. Data are not linked physically but through a sparQL query, thus the integration is virtual. 3.1. Mapping – Virtualization process associating physical data to ontology concepts. 3.2. Quering - Virtual integration by SparQL queries. Results can be exported into the desired format. Data are federated, that is to say, they can be viewed as a single coherent set, even when actual data sources vary in format and storage technology. The components implementing the mapper and the reasoner are submodules of an Ontology Based Management System (OBDA System).

  4. Direct dissemination – End point can be used to query and disseminate data in table format or send data to specific applications. They represent the communication interface with the external world. 4.1. SparQL – Queries are provided as input to the system through the endpoint interface 4.2. SparQL result – Results are provided as output through the endpoint interface.

  5. Context Broker Ingestion – Data exported in JSON can be sent to context broker via a converter module into NGSI-LD format. 5.1. Queries are provided at design time by the designer when using this endpoint. 5.2. Results must be converted into a specified format JSON NGSI-LD through a dedicated converter module.

Computing Management and Optimization

Tasks can be managed and executed either physically or virtually, to balance the resources for data processing. While physical elaboration is quicker but static, the virtualization is more dynamic but more complex. One example is the conversion of the Geo coordinates in Administrative Units (LAU). If Virtualization is chosen, GEO SparQL queries can be integrated into the Ontological framework, and the corresponding LAU can be derived virtually, but this is rather heavy on the reasoner because the result must be calculated on the fly at each query. So, one can just statically convert the coordinates through a dedicated service and create a materialized new variable to store LAU and then virtualize it without the need to reference GEO SparQL.

pafrance commented 2 years ago

S4Y French and Italian School Registry Data Acquisition phase

French data: Schools: number of students by school level, number of classes. Italian data: MIUR (Italian Ministry of Education)

S4Y Data Processing - Harmonization phase. Data must be harmonized according to the common data model specified in the table below. immagine

Common Data Model is directly related to corresponding Ontology concepts, to make ontology mapping easier. Some variables must be created or extracted by other services:

Metadata: ISCED (International Standard Classification of Education) refers to the national (and sub-national) education programme and the related recognised educational qualification. Link to map the classifications of different countries Mapping ISCED (CITE) – French data

IT Data Preparation and File name structure for data acquisition • During the upload, data must be transformed to add the Type of educational institution (Public or Private) and the reference school year • The file name is created according to a nomenclature rule Code list of variable DescrizioneTipologiaGradoIstruzioneScuola:

Example: immagine

Educational attainment level: low education, medium education, high education ISCED attainment level low education • ISCED 0: Early childhood education (‘less than primary’ for educational attainment) • ISCED 1: Primary education • ISCED 2: Lower secondary education medium education • ISCED 3: Upper secondary education • ISCED 4: Post-secondary non-tertiary educ high education • ISCED 5: Short-cycle tertiary education • ISCED 6: Bachelor’s or equivalent level • ISCED 7: Master’s or equivalent level • ISCED 8: Doctoral or equivalent level

Istitution type { public = 1, private = 0 }

Metadata and harmonization outline are provided in the attached excel file S4Y-SchoolRegistry.meta.xlsx

ThomasPO commented 2 years ago

For the French dataset about schools, variables "latitude" and "longitude" are expressed in WGS84 international format. It's not neccessary to convert coordinates. There is only a list of current schools more precisely there are three status:

For getting scholastic year for no closed schools, different possibilities:

  1. Based on "date_ouverture" variable (meaning opening date), it's possible to create new entries. For instance, if a school is created on 2020-09-01, we will create 2 years (2020 and 2021 and not 2019). Here i'm considering the scholastic year as the starting year.
  2. Three scholastic years (2019 to 2021) are available in datasets contaings data about number of students. Merging with these datasets allows to get scholastic years
ThomasPO commented 2 years ago

Conclusion on what has been done on the French pipeline.

The French datasets used are:

  1. The school registry dataset (called "Address and geolocalization of primary and secondary educational institutions") Data extraction is performed using the API to retrieve a CSV file.

The number of students for the three last scholastic year (2019, 2020 and 2021) from 4 datasets :

  1. Student enrollment by grade and number of classes per school (only primary and nursery school)
  2. Number of students by grade, gender, most frequent modern languages 1 and 2, by lower secondary school
  3. Number of students by grade, gender, most frequent modern languages 1 and 2, by upper secondary school (school of general and technological education)
  4. Number of students by grade, gender, most frequent modern languages 1 and 2, by upper secondary school (vocational school)

Data extraction is performed based CSV files directly available online.

Variables finally extracted are: Field name Description Data type Comment
school_id Unique school identifier string variable "numero_uai" in the dataset n°1
name Official name of the school string variable "appellation_officielle in the dataset n°1
latitude Latitude of the school (WGS84) Float variable "latitude" in the dataset n°1
longitude Longitude of the school (WGS84) Float variable "longitude" in the dataset n°1
lau Lau code Code list variable "code_commune" in the dataset n°1
institution_type Type of institution Code list (Coded as "PR" for "Private" and "PU" for "Public") variable "secteur_public_prive_libe" in the dataset n°1
code_nature Nature code Code list variable "nature_uai" in the dataset n°1
ISCED_level Level of ISCED classification Code list coded based on code_nature (see mapping file here)
scholastic_year Scholastic year Year Datasets n°2 to n°5
students_number Number of students (measure) per scholastic year integer Datasets n°2 to n°5

The ETL process producing French data file is organized according to the usual steps:

pafrance commented 1 year ago

I replace "cordonne x,z" with legit latitude and longitude and will upgrade the excel I posted earlier.

About school year. I personally doubt that school year will be of any use in the dataset about schools. I think it is more useful with student data. Do anybody else shares my thought?

----- Messaggio originale ----- Da: "Thomas Dubois" @.> A: "INTERSTAT" @.> Cc: "Paolo Francescangeli" @.>, "Mention" @.> Inviato: Giovedì, 14 aprile 2022 10:57:43 Oggetto: Re: [INTERSTAT/Statistics-Contextualized] Specify S4Y data workflow (Issue #14)

For the French dataset about schools, variables "latitude" and "longitude" are expressed in WGS84 international format. It's not neccessary to convert coordinates. There is only a list of current schools more precisely there are three status:

For getting scholastic year for no closed schools, different possibilities:

  1. Based on "date_ouverture" variable (meaning opening date), it's possible to create new entries. For instance, if a school is created on 2020-09-01, we will create 2 years (2020 and 2021 and not 2019). Here i'm considering the scholastic year as the starting year.
  2. Three scholastic years (2019 to 2021) are available in datasets contaings data about number of students. Merging with these datasets allows to get scholastic years

-- Reply to this email directly or view it on GitHub: https://urlsand.esvalabs.com/?u=https%3A%2F%2Fgithub.com%2FINTERSTAT%2FStatistics-Contextualized%2Fissues%2F14%23issuecomment-1098884267&e=17c5563b&h=a429f30a&f=n&p=y You are receiving this because you were mentioned.

Message ID: @.***>