BAG Full Extract V2 is loaded once in bigquery

galamit86 commented 2 years ago

This single extract of the BAG registry should be uploaded a single time into bigquery (that will serve as the basis for a recurring processing of mutations).

Eddy's Code here could possibly be reused.

Todo (rough)

~- [ ] Load a single row~ ~- [ ] Load a single file~ ~- [ ] load a folder~ ~- [ ] load a different folder~ ~- [ ] load all folders~

ghost commented 2 years ago

Update:

Managed to upload multiple files into GBQ.
- Using my old script test_prefect_bag.py which is now in this branch in the kadaster folder.
  - Need improvements, i.e. uses the old config file
- As discussed, first let GBQ decide the datatypes which gave an error since GBQ thinks the datatype for a field has changed.
On the Developer Kadaster website I found the schemas which are referenced in Koppelvlak BAG 2.0 Extract.
- Inside the ZIP are multiple xsd files
  - In _IMBAGLVObjecten-2.1.0.xsd some datatypes are defined for the attributes of a BAG object.
  - Was able to upload files for Woonplaats and Nummeraanduiding in GBQ by updating bag_schemas.py using the above file.
  - Compared the datatypes defined for Woonplaats and Nummeraanduiding in _IMBAGLVObjecten-2.1.0.xsd and datafundament-logical.dbs and there are some small differences.

galamit86 commented 2 years ago

@eddyVintus - great, summary, thanks!

I propose our next step could be writing a simple general parser translating IMBAGLV_Objecten-2.1.0.xsd to something like your bag_schemas.py. Any thoughts? Have you had something else in mind to go forward?

@JeremyVintus @dkapitan - Any input?

--EDIT--

Considering the option to natively load XML using spark - is this still relevant? Is the schema output ready for bq, or is there a translation needed?

dkapitan commented 2 years ago

@eddyVintus - great, summary, thanks!

I propose our next step could be writing a simple general parser translating IMBAGLV_Objecten-2.1.0.xsd to something like your bag_schemas.py. Any thoughts? Have you had something else in mind to go forward?

@JeremyVintus @dkapitan - Any input?

--EDIT--

Considering the option to natively load XML using spark - is this still relevant? Is the schema output ready for bq, or is there a translation needed?

@galamit86 @eddyVintus I would give the native spark.xml parser a try and load the data as-is into gbq with the least amount of code.

I was surprised to see it work. If we can get that up and running, than we can ingest as-is and focus on downstream integration/data enrichment etc.

ghost commented 2 years ago

Update:

@MarcZoon and I have been working on reading the xml files into a Spark DF using the code from the notebook Daniel send.
The second step was trying to load a whole DF into BQ, which semi works.
- We are getting errors regarding date values.
- Already made sure to make the datetime compatible in Parquet for Spark 2.x and Spark 3.x
- BAG objects such as WPL, LIG and OPR do not give any errors while loading into BQ using the above compatibility parameters.
Marc was able to load in a Spark DF into BQ using Dataproc in GCP.
- By using a Python script that reads the XML files in a Spark DF and writes this to BQ.
- Making use of spark-xml and spark-bigquery-connector
- test_pyspark.py for local testing.
- test_dataproc.py is used in Dataproc on GCP.

ghost commented 2 years ago

Update:

Based on functions in test_dataproc.py tried to wrap them as Prefect Tasks in test_pyspark.py for local testing.
Was able to write the DF's to a snappy.parquet?? file.
Still have to upload these files to GCS, now used a rough approach using Cloud Storage API.

MarcZoon commented 2 years ago

Using the bag2gcs.py file, we can now register a flow that stores the bag objects as .parquet files on GCS.

One issue remains, for some reason the flow is unable to completely run using Prefect version 1.0.0. But it does work on version 0.15.13.

ghost commented 2 years ago

Created Register and Run for both flows

MarcZoon commented 2 years ago

Update:

Added a new prefect flow (and files it depends on).

We can now create a Dataproc cluster from within the flow, and submit a PySpark job. Therefore, most work is now done on Dataproc, while still being managed by Prefect.

I have not had the chance to do a full load yet, but partial loads were successful, so it seems like it should work. (knock on wood)

Running the flow requires the two key value pairs in the KV store on prefect cloud, and a secret to store the GCP credentials. (I will add examples to the readme file later)

dataverbinders / toepol

BAG Full Extract V2 is loaded once in bigquery #3

Todo (rough)