GoogleCloudPlatform / dataproc-templates

Dataproc templates and pipelines for solving simple in-cloud data tasks
Apache License 2.0
118 stars 95 forks source link

[Spike] [BQ Partitioning] Explore how xyz to BQ template(s) can allow Partitioning and Clustering #289

Open shashank-google opened 2 years ago

shashank-google commented 2 years ago

Option 1 - Test and verify User would manually create empty table in BQ with partitioning and clustering. The xyz to BQ template will then move data into it. What if sequence of columns in source avro / jdbc / hive etc do not match with existing table in BigQuery.

Option 2 - Explore If BQ table does not exists (or overwrite flag is supplied), then how can template automatically determine clustering and partitioning. Look at corresponding Dataflow templates for ideas.

PoulamiR1994 commented 1 year ago

Tested the following templates for Option 1 and the results are as follows :

  1. GCStoBQ template : When this template is run with the destination table and source data having a different sequence of columns (using parquet files) , the data insertion faces no issues. Other data formats are being tested.
  2. JDBCtoBQ template : When this template is run with the destination table and source table having different sequence of columns, the data insertion faces no issues.
  3. HIVEtoBQ template : Testing in progress.
ritika-neema commented 1 year ago

Testing in progress for HIVEtoBQ template. For part 2 of the description, we can utilise partition and clustering attributes from spark-bigquery connector. Additional checks can be made for partitioning field, only when it is date/datetime/similar field we can continue partitioning given bigquery constraints.

ritika-neema commented 1 year ago

Dependent on child issues #631 #632