dataverbinders / nl-open-data

A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack
https://dkapitan.github.io/nl-open-data
MIT License
0 stars 1 forks source link

Data modelling #88

Open galamit86 opened 3 years ago

galamit86 commented 3 years ago

@dkapitan - I think this a good place for you to step in.

See below @Michelangelo367's ideas - I am guessing you have some thoughts and could perhaps assist with some guidance or focus?

What I have done in the past for UWV is as follows: Creating a data model and checking the data with the source (system). And from the application screens, we used the possibilities of building a data model from scratch. We can do more or less the same with CBS data, since CBS has already some presentation screens (with filters) on their website. That information is familiar for clients like Gemeente or Commercial organizations. I think we should start with the basic detail key views (data model) and then expanding them with more fields. And then add aggregation views in a later stage.

Michelangelo367 commented 3 years ago

I think statline data has been transformed with python in google cloud tables (dataverbinders-cbs-dl), so we are missing some original information which is present in the csv downloads of cbs statline. Or is (dataverbinders-cbs-dl) not the raw import data?

dataverbinders-cbs-dl: cbs_v4_83765NED If I compare cbs_statline_metadata_csv..png, cbs_statline_untyped_203629_csv..png, cbs_statline_typed_203639_csv..png (downloads: metadata, onbewerkte dataset - filter amsterdam, dataset voor grafische weergave - filter amsterdam) Onbewerkte dataset Onderwerpen: Alles (108), Wijken en buurten: Selectie (579 / 16667), Aantal geselecteerde cellen: 62.532 https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=83765NED&_theme=235

with:

google_cloud_83765NED_MeasureGroups..png, google_cloud_83765NED_Observations.png, google_cloud_83765NED_WijkenenBuurtenCodes.png, google_cloud_83765NED_WijkenenBuurtenGroups.png

It would be beneficial to add the missing fields (red) of printscreen "missing metadata fields in google cloud", so we can make (sub)groups available as a selection criteria for visualization reports! missing metadata fields in google cloud_2021-04-24_21-47-12

To do: Still need to calculate the summary for instance mannen_6 in combination with wijkenenbuurten and see if the same number appears in google cloud. The total summary number for mannen_6 for typed and untyped are the same.

Action: In CSV files I noticed the soortregio_2 (Gemeente, Wijk, Buurt). I guess it's from the first two letters of Wijkenbuurten (GM, WK, BU), but perhaps there is a separate lookup reference table table available. Otherwise we need to add the Code (GM, WK, BU) with description into the table 83765NED_WijkenenBuurtenCodes.

The item Mannen_6 is trimmed into Mannen and the ID 6 is not available in the Google Cloud Platform. Mannen (6) and Vrouwen (7) are a sublist of Geslacht (8) (TopicGroup).

Action: What we need to do is add the hierarchy id (topicgroup) into measuregroups and add the missing hierarchy id's into measurecodes! Or creating a seperate hierarchy table. We need to see how reports are performing as well as a check.

I guess this issue with hierarchy structure and description is present in other tables as well!

cbs_statline_typed_203639_2021-04-24_21-20-55 cbs_statline_untyped_203629_2021-04-24_21-19-07 google_cloud_83765NED_WijkenenBuurtenGroups2021-04-24_21-10-33 google_cloud_83765NED_WijkenenBuurtenCodes_2021-04-24_21-08-26 google_cloud_83765NED_Observations_2021-04-24_21-06-03 google_cloud_83765NED_MeasureGroups_2021-04-24_21-02-34 cbs_statline_metadata_csv_2021-04-24_20-57-09 google_cloud_83765NED_MeasureCodes_2021-04-24_20-56-05 google_cloud_platform_83765NED_2021-04-24_19-27-01

Michelangelo367 commented 3 years ago

I see that there is a hierarchy in the parquet files measuregroups and measurecodes, but It's not the csv numbering. So, we can still make the relationship! No action needed!!

I see the T001038 MeasureGroupID in the 83765NED_MeasureCodes as well. T001038 is the link to ****

83765NED_MeasureGroups: { "Id": "T001038", **** "Index": "8", "Title": "Geslacht", "Description": "", "ParentId": "M000352" },

83765NED_MeasureCodes: { "Identifier": "3000", "Index": "9", "Title": "Mannen", "Description": "", "MeasureGroupId": "T001038", **** "DataType": "Long", "Unit": "aantal", "Format": null, "Decimals": "0", "PresentationType": "Absolute" }, { "Identifier": "4000", "Index": "10", "Title": "Vrouwen", "Description": "", "MeasureGroupId": "T001038", **** "DataType": "Long", "Unit": "aantal", "Format": null, "Decimals": "0", "PresentationType": "Absolute" },

83765NED_MeasureGroups parquet_2021-04-24_23-23-43 83765NED_MeasureCodes parquet_2021-04-24_23-22-42

dkapitan commented 3 years ago

@Michelangelo367 let's try to meet online this week to get to know each other and discuss this project. I have just send you a LinkedIn request.

Michelangelo367 commented 3 years ago

I noticed that there are duplicates in table cbs_v4_83765NED.83765NED_Observations See select query: Output: row 1-8

image

Michelangelo367 commented 3 years ago

@galamit86 Do you know how to get similar website search results (8) and get as well the table information with python script? Is the table information like for instance 83005NED stored as 83005NED_DataProperties ( Source format PARQUET)? Is it possible to select the Parquet data with SQL?

image

image

I already found standard python scripts of cbs v3 and v4.

dkapitan commented 3 years ago

I noticed that there are duplicates in table cbs_v4_83765NED.83765NED_Observations See select query: Output: row 1-8

image

@Michelangelo367 @galamit86 I think we may have duplicate data in dataverbinders-cbs-dl because this query does yield only 2 rows as a result:

SELECT * FROM `dataverbinders-dev.cbs_v4_83765NED.83765NED_Observations`
where Measure in ('3000', '4000') and WijkenEnBuurten = 'GM0363'
Michelangelo367 commented 3 years ago

@galamit86 @dkapitan Daniel mentioned topic mantelzorg and huisarts today;

I am thinking as well about topics that are now in discussion: 1)weten hoe de woon-, werk, leer en vervoerspatronen veranderen als gevolg van COVID 2)wat zijn structurele effecten van COVID op natuurontwikkeling (stikstof, CO2, fijnstof, etc ) 3)inzicht krijgen in de samenstelling van de agrarische sector; gericht op verduurzaming van de landbouw 4)Voor zwemwaterlocaties kunnen voorspellen hoe de kwaliteit van het zwemwater zich gaat ontwikkelen zodat ik gerichter proefmonsters kan afnemen

Should I first focus on developing datamodel for mantelzorg and huisarts?

Other topics are: -weggedrag beinvloeden prov.wegen om dodelijke ongelijken te reduceren -biodiversiteit via satalietbeelden -monitoring diverse parameters per regio (groei, krimp, faillissement) -groendaken -verhuisbewegingen als gevolg covid -correlatie wonen en arbeidsplaatsen -droogte -niet gerealiseerde woningprojecten -impact covid op ov gebruik -impact klimaatmaatregelen op co2 -veranderende woon en werkbehoefte ivm bereikbaarheid

Michelangelo367 commented 3 years ago

@galamit86 @dkapitan How can I connect to google cloud platform and retrieve table information (like select * from ...) with Python?

galamit86 commented 3 years ago

I noticed that there are duplicates in table cbs_v4_83765NED.83765NED_Observations See select query: Output: row 1-8 image

@Michelangelo367 @galamit86 I think we may have duplicate data in dataverbinders-cbs-dl because this query does yield only 2 rows as a result:

SELECT * FROM `dataverbinders-dev.cbs_v4_83765NED.83765NED_Observations`
where Measure in ('3000', '4000') and WijkenEnBuurten = 'GM0363'

That's true. There was a bug in the code relating to V4 datasets, that I fixed a while ago, but I haven't force-pushed all v4 datasets a second time. I just manually pushed 83765NED, so it's ok. I will rerun all V4 soon.

galamit86 commented 3 years ago

@Michelangelo367 - re your other question(s) - I am not 100% sure what you are looking for, and if it's one question or more. If you want to interact directly with the Parquet files stored on Google storage, you should look at the Google Storage API. If you want to interact with BQ (which I think is more likely), you should look at the Google BQ API. By using BQ as an intermediate, you can use SQL "on" the parquet files - our tables are linked to the parquet files.

If you need something else, let me know, maybe a call would help.

Michelangelo367 commented 3 years ago

@galamit86 I noticed that older year have sometimes a different output table. Do you know if cbs is changing old table structure (like 2016-2017) to the new table structure (format)(2018-2020) or not?

I had an example of Kerncijfers wijken en buurten: 2016/2017 - 83765NED, 83487NED 2018/2019/2020 - 84799NED, 84583NED, 84286NED

galamit86 commented 3 years ago

@galamit86 I noticed that older year have sometimes a different output table. Do you know if cbs is changing old table structure (like 2016-2017) to the new table structure (format)(2018-2020) or not?

I had an example of Kerncijfers wijken en buurten: 2016/2017 - 83765NED, 83487NED 2018/2019/2020 - 84799NED, 84583NED, 84286NED

I don't know, sorry..

Michelangelo367 commented 3 years ago

@Michelangelo367 - re your other question(s) - I am not 100% sure what you are looking for, and if it's one question or more. If you want to interact directly with the Parquet files stored on Google storage, you should look at the Google Storage API. If you want to interact with BQ (which I think is more likely), you should look at the Google BQ API. By using BQ as an intermediate, you can use SQL "on" the parquet files - our tables are linked to the parquet files.

If you need something else, let me know, maybe a call would help.

In BQ I used a select * from, but then I got an error because of parquet file.

I think it's better to schedule a call, so we can discuss some thing about how better to use Google Big Query. How about videocall tomorrow around 15:30?

galamit86 commented 3 years ago

I think it's better to schedule a call, so we can discuss some thing about how better to use Google Big Query. How about videocall tomorrow around 15:30?

Sure. I have another call at 16:00, so if you think a half hour is enough, that's fine - otherwise I'm available the rest of the day, so earlier or later is also fine.

Michelangelo367 commented 3 years ago

I think it's better to schedule a call, so we can discuss some thing about how better to use Google Big Query. How about videocall tomorrow around 15:30?

Sure. I have another call at 16:00, so if you think a half hour is enough, that's fine - otherwise I'm available the rest of the day, so earlier or later is also fine.

15:30 is great. Shall I schedule a meeting?