Open galamit86 opened 3 years ago
I think statline data has been transformed with python in google cloud tables (dataverbinders-cbs-dl), so we are missing some original information which is present in the csv downloads of cbs statline. Or is (dataverbinders-cbs-dl) not the raw import data?
dataverbinders-cbs-dl: cbs_v4_83765NED If I compare cbs_statline_metadata_csv..png, cbs_statline_untyped_203629_csv..png, cbs_statline_typed_203639_csv..png (downloads: metadata, onbewerkte dataset - filter amsterdam, dataset voor grafische weergave - filter amsterdam) Onbewerkte dataset Onderwerpen: Alles (108), Wijken en buurten: Selectie (579 / 16667), Aantal geselecteerde cellen: 62.532 https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=83765NED&_theme=235
with:
google_cloud_83765NED_MeasureGroups..png, google_cloud_83765NED_Observations.png, google_cloud_83765NED_WijkenenBuurtenCodes.png, google_cloud_83765NED_WijkenenBuurtenGroups.png
It would be beneficial to add the missing fields (red) of printscreen "missing metadata fields in google cloud", so we can make (sub)groups available as a selection criteria for visualization reports!
To do: Still need to calculate the summary for instance mannen_6 in combination with wijkenenbuurten and see if the same number appears in google cloud. The total summary number for mannen_6 for typed and untyped are the same.
Action: In CSV files I noticed the soortregio_2 (Gemeente, Wijk, Buurt). I guess it's from the first two letters of Wijkenbuurten (GM, WK, BU), but perhaps there is a separate lookup reference table table available. Otherwise we need to add the Code (GM, WK, BU) with description into the table 83765NED_WijkenenBuurtenCodes.
The item Mannen_6 is trimmed into Mannen and the ID 6 is not available in the Google Cloud Platform. Mannen (6) and Vrouwen (7) are a sublist of Geslacht (8) (TopicGroup).
Action: What we need to do is add the hierarchy id (topicgroup) into measuregroups and add the missing hierarchy id's into measurecodes! Or creating a seperate hierarchy table. We need to see how reports are performing as well as a check.
I guess this issue with hierarchy structure and description is present in other tables as well!
I see that there is a hierarchy in the parquet files measuregroups and measurecodes, but It's not the csv numbering. So, we can still make the relationship! No action needed!!
I see the T001038 MeasureGroupID in the 83765NED_MeasureCodes as well. T001038 is the link to ****
83765NED_MeasureGroups: { "Id": "T001038", **** "Index": "8", "Title": "Geslacht", "Description": "", "ParentId": "M000352" },
83765NED_MeasureCodes: { "Identifier": "3000", "Index": "9", "Title": "Mannen", "Description": "", "MeasureGroupId": "T001038", **** "DataType": "Long", "Unit": "aantal", "Format": null, "Decimals": "0", "PresentationType": "Absolute" }, { "Identifier": "4000", "Index": "10", "Title": "Vrouwen", "Description": "", "MeasureGroupId": "T001038", **** "DataType": "Long", "Unit": "aantal", "Format": null, "Decimals": "0", "PresentationType": "Absolute" },
@Michelangelo367 let's try to meet online this week to get to know each other and discuss this project. I have just send you a LinkedIn request.
I noticed that there are duplicates in table cbs_v4_83765NED.83765NED_Observations See select query: Output: row 1-8
@galamit86 Do you know how to get similar website search results (8) and get as well the table information with python script? Is the table information like for instance 83005NED stored as 83005NED_DataProperties ( Source format PARQUET)? Is it possible to select the Parquet data with SQL?
I already found standard python scripts of cbs v3 and v4.
I noticed that there are duplicates in table cbs_v4_83765NED.83765NED_Observations See select query: Output: row 1-8
@Michelangelo367 @galamit86
I think we may have duplicate data in dataverbinders-cbs-dl
because this query does yield only 2 rows as a result:
SELECT * FROM `dataverbinders-dev.cbs_v4_83765NED.83765NED_Observations`
where Measure in ('3000', '4000') and WijkenEnBuurten = 'GM0363'
@galamit86 @dkapitan Daniel mentioned topic mantelzorg and huisarts today;
I am thinking as well about topics that are now in discussion: 1)weten hoe de woon-, werk, leer en vervoerspatronen veranderen als gevolg van COVID 2)wat zijn structurele effecten van COVID op natuurontwikkeling (stikstof, CO2, fijnstof, etc ) 3)inzicht krijgen in de samenstelling van de agrarische sector; gericht op verduurzaming van de landbouw 4)Voor zwemwaterlocaties kunnen voorspellen hoe de kwaliteit van het zwemwater zich gaat ontwikkelen zodat ik gerichter proefmonsters kan afnemen
Should I first focus on developing datamodel for mantelzorg and huisarts?
Other topics are: -weggedrag beinvloeden prov.wegen om dodelijke ongelijken te reduceren -biodiversiteit via satalietbeelden -monitoring diverse parameters per regio (groei, krimp, faillissement) -groendaken -verhuisbewegingen als gevolg covid -correlatie wonen en arbeidsplaatsen -droogte -niet gerealiseerde woningprojecten -impact covid op ov gebruik -impact klimaatmaatregelen op co2 -veranderende woon en werkbehoefte ivm bereikbaarheid
@galamit86 @dkapitan How can I connect to google cloud platform and retrieve table information (like select * from ...) with Python?
I noticed that there are duplicates in table cbs_v4_83765NED.83765NED_Observations See select query: Output: row 1-8
@Michelangelo367 @galamit86 I think we may have duplicate data in
dataverbinders-cbs-dl
because this query does yield only 2 rows as a result:SELECT * FROM `dataverbinders-dev.cbs_v4_83765NED.83765NED_Observations` where Measure in ('3000', '4000') and WijkenEnBuurten = 'GM0363'
That's true. There was a bug in the code relating to V4 datasets, that I fixed a while ago, but I haven't force-pushed all v4 datasets a second time. I just manually pushed 83765NED, so it's ok. I will rerun all V4 soon.
@Michelangelo367 - re your other question(s) - I am not 100% sure what you are looking for, and if it's one question or more. If you want to interact directly with the Parquet files stored on Google storage, you should look at the Google Storage API. If you want to interact with BQ (which I think is more likely), you should look at the Google BQ API. By using BQ as an intermediate, you can use SQL "on" the parquet files - our tables are linked to the parquet files.
If you need something else, let me know, maybe a call would help.
@galamit86 I noticed that older year have sometimes a different output table. Do you know if cbs is changing old table structure (like 2016-2017) to the new table structure (format)(2018-2020) or not?
I had an example of Kerncijfers wijken en buurten: 2016/2017 - 83765NED, 83487NED 2018/2019/2020 - 84799NED, 84583NED, 84286NED
@galamit86 I noticed that older year have sometimes a different output table. Do you know if cbs is changing old table structure (like 2016-2017) to the new table structure (format)(2018-2020) or not?
I had an example of Kerncijfers wijken en buurten: 2016/2017 - 83765NED, 83487NED 2018/2019/2020 - 84799NED, 84583NED, 84286NED
I don't know, sorry..
@Michelangelo367 - re your other question(s) - I am not 100% sure what you are looking for, and if it's one question or more. If you want to interact directly with the Parquet files stored on Google storage, you should look at the Google Storage API. If you want to interact with BQ (which I think is more likely), you should look at the Google BQ API. By using BQ as an intermediate, you can use SQL "on" the parquet files - our tables are linked to the parquet files.
If you need something else, let me know, maybe a call would help.
In BQ I used a select * from, but then I got an error because of parquet file.
I think it's better to schedule a call, so we can discuss some thing about how better to use Google Big Query. How about videocall tomorrow around 15:30?
I think it's better to schedule a call, so we can discuss some thing about how better to use Google Big Query. How about videocall tomorrow around 15:30?
Sure. I have another call at 16:00, so if you think a half hour is enough, that's fine - otherwise I'm available the rest of the day, so earlier or later is also fine.
I think it's better to schedule a call, so we can discuss some thing about how better to use Google Big Query. How about videocall tomorrow around 15:30?
Sure. I have another call at 16:00, so if you think a half hour is enough, that's fine - otherwise I'm available the rest of the day, so earlier or later is also fine.
15:30 is great. Shall I schedule a meeting?
@dkapitan - I think this a good place for you to step in.
See below @Michelangelo367's ideas - I am guessing you have some thoughts and could perhaps assist with some guidance or focus?