dataverbinders / toepol

Making Dutch open data more easily accesible
0 stars 0 forks source link

BAG Full Extract V2 is loaded once in bigquery #3

Open galamit86 opened 2 years ago

galamit86 commented 2 years ago

This single extract of the BAG registry should be uploaded a single time into bigquery (that will serve as the basis for a recurring processing of mutations).

Eddy's Code here could possibly be reused.

Todo (rough)

~- [ ] Load a single row~ ~- [ ] Load a single file~ ~- [ ] load a folder~ ~- [ ] load a different folder~ ~- [ ] load all folders~

ghost commented 2 years ago

Update:

galamit86 commented 2 years ago

@eddyVintus - great, summary, thanks!

I propose our next step could be writing a simple general parser translating IMBAGLV_Objecten-2.1.0.xsd to something like your bag_schemas.py. Any thoughts? Have you had something else in mind to go forward?

@JeremyVintus @dkapitan - Any input?

--EDIT--

Considering the option to natively load XML using spark - is this still relevant? Is the schema output ready for bq, or is there a translation needed?

dkapitan commented 2 years ago

@eddyVintus - great, summary, thanks!

I propose our next step could be writing a simple general parser translating IMBAGLV_Objecten-2.1.0.xsd to something like your bag_schemas.py. Any thoughts? Have you had something else in mind to go forward?

@JeremyVintus @dkapitan - Any input?

--EDIT--

Considering the option to natively load XML using spark - is this still relevant? Is the schema output ready for bq, or is there a translation needed?

@galamit86 @eddyVintus I would give the native spark.xml parser a try and load the data as-is into gbq with the least amount of code.

I was surprised to see it work. If we can get that up and running, than we can ingest as-is and focus on downstream integration/data enrichment etc.

ghost commented 2 years ago

Update:

ghost commented 2 years ago

Update:

MarcZoon commented 2 years ago

Using the bag2gcs.py file, we can now register a flow that stores the bag objects as .parquet files on GCS.

One issue remains, for some reason the flow is unable to completely run using Prefect version 1.0.0. But it does work on version 0.15.13.

ghost commented 2 years ago

Created Register and Run for both flows

MarcZoon commented 2 years ago

Update:

Added a new prefect flow (and files it depends on).

We can now create a Dataproc cluster from within the flow, and submit a PySpark job. Therefore, most work is now done on Dataproc, while still being managed by Prefect.

I have not had the chance to do a full load yet, but partial loads were successful, so it seems like it should work. (knock on wood)

Running the flow requires the two key value pairs in the KV store on prefect cloud, and a secret to store the GCP credentials. (I will add examples to the readme file later)