GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
374 stars 196 forks source link

Support OpenLineage in spark-3.x-bigquery connectors #1212

Closed codelixir closed 5 months ago

codelixir commented 5 months ago
  1. Add openlineage properties to Spark31BigQueryTable class
  2. Add BigQueryRelationProvider as an abstract class to v2 module, to be extended by BaseBigQuerySource (parent class of all the Spark BigQuery Table Provider classes).
vishalkarve15 commented 5 months ago

/gcbrun

vishalkarve15 commented 5 months ago

/gcbrun

codelixir commented 5 months ago

I have moved the logic to the common module, as discussed, so that both dsv1 and dsv2 connectors call the same method internally.

vishalkarve15 commented 5 months ago

/gcbrun

vishalkarve15 commented 5 months ago

/gcbrun

ddebowczyk92 commented 5 months ago

Hey @codelixir, thank you for your contribution! We appreciate your effort. Have you thought about leveraging the spark-interfaces-scala package for generating metadata for OpenLineage events? This package is designed to facilitate the transition of lineage extraction ownership to the Spark extension owners. You can find more information about it here. Thanks once again for your contribution!

davidrabinowitz commented 5 months ago

Hi @ddebowczyk92 , thanks for the input! We try to keep the DataSource v2 connectors Scala agnostic in order to simplify the usage for customers due to the incompatibility between Scala 2.12 and 2.13. Once this is PR is done, we can think how to incorporate the interface into the connector.

vishalkarve15 commented 5 months ago

/gcbrun

davidrabinowitz commented 5 months ago

/gcbrun

davidrabinowitz commented 5 months ago

/gcbrun