overall: we need to think about how to set the proper data types in the parquet format (it is now inferred automatically)
When converting from json to pyarrow table, ParseOptions could be used to provide a specific schema. That's where we could provide it, and I assume it will be kept when written to Parquet. How to infer it should still be investigated. Perhaps the CBS data provides it somewhere
overall: we need to do some proper scaffolding with a configuration file (.toml) where we set
the temp directory on the host machine
set up the GCP project parameters
Furthermore, since we are incorporating GCS, we can skip loading the data as-is into BigQuery but use external tables instead.
So all in all, with all these major upgrades and changes I am thinking that it is perhaps even better to define a new, separate project cbs-bq with the following outline:
Prerequisites: GCP account with at least 1 GCS bucket and 1 project with BigQuery activated. These settings go in a config file
Input: list of v3 and/or v4 datasets, which can go in a .toml configuration file
Options:
keep older version of dataset on GCS when downloading new one (default=True)
Output after running commandline app:
GCS is filled with parquet files, by API-version/dataset-id/date
All parquet files are queryable as external tables in BigQuery, with separate BigQuery datasets for v3 and v4 (for clarity)
Table descriptions are added (feature that Eddy has already written, but he was struggling a bit because loading the data asynchronously was a bit of a pain to know when he could add the descriptions. We don't have that problem now)
overall: we need to think about how to set the proper data types in the parquet format (it is now inferred automatically)
overall: we need to do some proper scaffolding with a configuration file (.toml) where we set
Furthermore, since we are incorporating GCS, we can skip loading the data as-is into BigQuery but use external tables instead.
So all in all, with all these major upgrades and changes I am thinking that it is perhaps even better to define a new, separate project cbs-bq with the following outline: