INTERSTAT / Statistics-Contextualized

Models for the dissemination of contextualized statistical data
6 stars 3 forks source link

SEP data workflow: Italian census data #9

Open FranckCo opened 2 years ago

FranckCo commented 2 years ago

Italian census data is currently produced manually. Explore possibilities of automation.

vaccaricarlo commented 2 years ago

Mail from Paolo confirms that it's not possible to automate the process for Census data

FranckCo commented 2 years ago

OK, but we still can do better than mail: have a documented manual procedure to produce the data and a fixed URL where they can be obtained.

pafrance commented 2 years ago

OK, but we still can do better than mail: have a documented manual procedure to produce the data and a fixed URL where they can be obtained.

Here's the permanent census data source You can browse the data manually, customize your query and export data using the provided toolbar. I also found a conctat page where it's possible to request data on demand, but it's still a manual task.

I would like if any data provider had a direct connection to datasets, but the reality is that any data provider out there has their own quirks and habits in their own data publication. I think it's good to build a protocol with specifications on how to GET data from data providers or let them POST data in our repository. Hera are some examples: 1) Auto-browsing: Whenever data are presented with an interface easy to browse, just like simple HTML or some other easy to browse format so that data extraction can be made automatically with a spider (just like google's) 2) Human-Interaction: Istat census is well made and all, but, alas it's made for HUMAN interaction. This is another case of study: Has it any meaning to devise a spider to browse such interfaces automatically? I don't even know if there actually are such devices at all. 3) Data provider initiative: We could devise a module in our system to ease data source referrals officers to post data on our site in a standardized approach. But we can never expect complete M2M compliance from data providers, at least for the foreseeable future.

FranckCo commented 2 years ago

I repeat, the procedure should (even if it is manual):

pafrance commented 2 years ago

Unfortunately, the interface is not machine readable. It cannot generate an url to get the file. So no.

The only solution is try to get machine readable data from another source. What about the SEP from Eurostat?

----- Messaggio originale ----- Da: "Franck Cotton" @.> A: "INTERSTAT/Statistics-Contextualized" @.> Cc: "Paolo Francescangeli" @.>, "Comment" @.> Inviato: Venerdì, 12 novembre 2021 13:49:35 Oggetto: Re: [INTERSTAT/Statistics-Contextualized] SEP data workflow: Italian census data (Issue #9)

I repeat, the procedure should (even if it is manual):

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://urlsand.esvalabs.com/?u=https%3A%2F%2Fgithub.com%2FINTERSTAT%2FStatistics-Contextualized%2Fissues%2F9%23issuecomment-967095627&e=17c5563b&h=4b978cb1&f=n&p=y

pafrance commented 2 years ago

Italian census data retrieval: main steps and workflow Census Data extraction Step 1: Download from Istat source website Step2: browse CENSUS OF POPULATION AND HOUSING from the left toolbar Step3: Select Population/Demographic characteristics and citizenship/Age structure – municipalities Step4: Select Customise/Table options from the top toolbar and a pop-up window appears Step5: Select from panel Dimension Member Labels/All dimensions/Use codes and then View data Step6: Select Customise/selection/Select time from the top toolbar and a pop-up window appears Step7: Select date range (2018-2018) and then View data Step8: Select Export/csv from the top toolbar and then select Download from the pop-up window The downloaded file is not compliant with the required DSD.

Data transformation The downloaded file has the following Data Structure: ITTER107,"Territory","TIPO_DATO_CENS_POP","Datatype","SEXISTAT1","Gender","ETA1","Age class","TIME","Select time","Value","Flag Codes","Flags"

  1. Data need to be filtered in order to obtain the requested Data Structure ,
  2. NUTS3 variable has been added through a transformation from ITTER107 Variable, using data from ISTAT LAU archive
  3. Provided metadata for NUTS3 transformation need to be downloaded from Istat website and merged.
    Metadata are referenced in a time series and Variable regarding year 2018 has been used in the script.
  4. Sex codelist needs to be translated according to the standard. Data have been transformed through an R script provided as download together with the present documentation.

Data Load The transformed file was uploaded into INTERSTAT GraphDB. GraphDB allows direct link to the resources by a GET permalink , but the raw data needs a little reworking to be accessed directly. It can be downloaded rewriting the POST URL using the ID in the permalink

Transformation script in R language Pilot A - census data processing.txt

francescadag commented 2 years ago

The source file from Italian Census has been uploaded in the FTP area of the project. As requested, metadata files for conversion from ISTAT territorial codes to LAU and NUTS3 has been uploaded to GitHub. In addition to the metadata, the Italian NUT3 has been uploaded to GitHub as well.

FranckCo commented 2 years ago

Census data pipeline for Italy now fully implemented (f94cf3b8a84f22adb68152e1832ddf02feeec4dd), except conversion to NGSI-LD.

pafrance commented 2 years ago

Hi, we were reviewing census output contained in the ftp repository and noticed several details that don't seem to add up. Can you check it up, please?

1) It seems that Age class didn't translate well with Italian data. It always says "Y_LT_5" or "Y_UN4" for all IT rows 2) French population seems a float while Italian is integer. Is such mismatch correct?