Open galamit86 opened 2 years ago
Update:
@eddyVintus - great, summary, thanks!
I propose our next step could be writing a simple general parser translating IMBAGLV_Objecten-2.1.0.xsd
to something like your bag_schemas.py
. Any thoughts? Have you had something else in mind to go forward?
@JeremyVintus @dkapitan - Any input?
--EDIT--
Considering the option to natively load XML using spark - is this still relevant? Is the schema output ready for bq, or is there a translation needed?
@eddyVintus - great, summary, thanks!
I propose our next step could be writing a simple general parser translating
IMBAGLV_Objecten-2.1.0.xsd
to something like yourbag_schemas.py
. Any thoughts? Have you had something else in mind to go forward?@JeremyVintus @dkapitan - Any input?
--EDIT--
Considering the option to natively load XML using spark - is this still relevant? Is the schema output ready for bq, or is there a translation needed?
@galamit86 @eddyVintus I would give the native spark.xml parser a try and load the data as-is into gbq with the least amount of code.
I was surprised to see it work. If we can get that up and running, than we can ingest as-is and focus on downstream integration/data enrichment etc.
Update:
Update:
Using the bag2gcs.py file, we can now register a flow that stores the bag objects as .parquet files on GCS.
One issue remains, for some reason the flow is unable to completely run using Prefect version 1.0.0. But it does work on version 0.15.13.
Created Register and Run for both flows
Update:
Added a new prefect flow (and files it depends on).
We can now create a Dataproc cluster from within the flow, and submit a PySpark job. Therefore, most work is now done on Dataproc, while still being managed by Prefect.
I have not had the chance to do a full load yet, but partial loads were successful, so it seems like it should work. (knock on wood)
Running the flow requires the two key value pairs in the KV store on prefect cloud, and a secret to store the GCP credentials. (I will add examples to the readme file later)
This single extract of the BAG registry should be uploaded a single time into bigquery (that will serve as the basis for a recurring processing of mutations).
Eddy's Code here could possibly be reused.
Todo (rough)
~- [ ] Load a single row~ ~- [ ] Load a single file~ ~- [ ] load a folder~ ~- [ ] load a different folder~ ~- [ ] load all folders~