apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.19k stars 2.38k forks source link

[SUPPORT] Problem while reading from BQ tables which are synced on Hudi table #9355

Open ranjanankur opened 11 months ago

ranjanankur commented 11 months ago

Problem while reading from BQ tables which are synced on Hudi table

We have one streaming application where we are getting data from Kafka. We are applying transformation using Spark Structure Streaming and then loading the data to Hudi tables(GCS bucket). After creating hudi tables we are syncing the same Hudi table with GCP BigQuery using BigQuerySync tool.

Now we are facing we are very weird problem and if our understanding of these bug is correct then we might have to remove hudi from streaming use case.

Let me explain the problem.

BigQuery sync tools internally is creating two BQ tables and one View.

  1. Manifest table
  2. Version table
  3. BQ view - This contain latest UPSERTED data.

For maintaining latest UPSERT. Hudi is creating one latest-snapshot.csv. in the gs:////.hoodie/manifest/latest-snapshot.csv

After every microbatch Hudi is actually replacing this file. It is deleting and recreating this file. Our batch duration of streaming pipeline is 5 min.

So after 5 min, new file will be created with the same name.

Now the problem is let suppose, I am calling BQ view and apply some join or some operation with other tables which takes more than 5 min to run then our BQ job throw error because latest-snapshot.csv is changed during this.

Error - Not found .hoodie/manifest/latest-snapshot.csv (Version 1690978310289362)

To Reproduce

Steps to reproduce the behavior:

  1. Create streaming pipeline
  2. Load into Hudi table
  3. Sync Hudi table with BigQuery using BigQuerySync tool
  4. Fire a query which uses BQ view created due to BigQuerySync and query should take more than the batch duration of streaming pipeline

Expected behavior

When I run some BiqQuery Statement then it should run without error.

Environment Description

ad1happy2go commented 11 months ago

@ranjanankur Can you please create a GCP ticket asking them what should be the preferred way for updating the manifest file?

ranjanankur commented 10 months ago

Hi @codope Any update on this. We are blocked due to this. We are finding very hard to use Hudi in our streaming pipeline due to this. Please let us know any solution for this.

the-other-tim-brown commented 10 months ago

@ranjanankur I'm taking a look at this and tracking with the JIRA ticket here as well https://issues.apache.org/jira/browse/HUDI-6672

I've reached out to the Google Cloud to confirm that this is an issue with updating the manifest while a query is running. The solution I'm working on will version these manifests so we do not modify the file while a query is in flight.

emkornfield commented 10 months ago

I've reached out to the Google Cloud to confirm that this is an issue with updating the manifest while a query is running.

This sounds like the likely cause (each table we make sure we read the exact same file that was specified in the URI). The solution that uses a view for compatibility with Hudi and BigQuery is inherently flawed. Using the newly contributed manifest file approach is going to be more robust along several dimensions.

ankur334 commented 10 months ago

Hi @emkornfield which new manifest file you are mentioning? Which part of the BigQuerySync code do I need to change?

the-other-tim-brown commented 10 months ago

@ankur334 There is a new path in the BigQuerySyncTool which uses this new BigQuery feature. You can also look at this post from Google Cloud: https://cloud.google.com/blog/products/data-analytics/bigquery-manifest-file-support-for-open-table-format-queries/

ranjanankur commented 10 months ago

Hi @the-other-tim-brown , @emkornfield In the above article it is mentioned to use the use-bq-manifest-file flag while running BigQuerySyncTool function to sync Hudi table with BigQuery tables.

It is mentioned to use the hudi-gcp-bundle-0.14.0-SNAPSHOT.jar. We have upgraded this JAR too but we are finding difficulty to add this use-bq-manifest-file in the code.

Instead of using spark-summit to run this code, we just call BigQuerySyncTool function from our code itself & we are confused where and how to pass this flag.

Please let us know where to pass this flag.

import org.apache.hudi.gcp.bigquery.BigQuerySyncConfig._
import org.apache.hudi.gcp.bigquery.BigQuerySyncTool

def getBigQueryProps: Properties = {
      val props = new Properties()
      props.setProperty(BIGQUERY_SYNC_PROJECT_ID.key, projectId)
      props.setProperty(BIGQUERY_SYNC_DATASET_NAME.key, datasetName)
      props.setProperty(BIGQUERY_SYNC_DATASET_LOCATION.key, datasetLocation)
      props.setProperty(BIGQUERY_SYNC_TABLE_NAME.key, tableName)
      props.setProperty(BIGQUERY_SYNC_SOURCE_URI.key, s"$tablePath/$firstLevelPartition=*")
      props.setProperty(BIGQUERY_SYNC_SOURCE_URI_PREFIX.key, s"$tablePath/")
      props.setProperty(BIGQUERY_SYNC_SYNC_BASE_PATH.key, tablePath)
      props.setProperty(BIGQUERY_SYNC_PARTITION_FIELDS.key, partitionKey)
      props.setProperty(META_SYNC_BASE_PATH.key(), tablePath)
      props.setProperty(BIGQUERY_SYNC_USE_FILE_LISTING_FROM_METADATA.key, "true")
      props.setProperty(BIGQUERY_SYNC_ASSUME_DATE_PARTITIONING.key, "false")
      props
    }

  new BigQuerySyncTool(getBigQueryProps).syncHoodieTable()

& please let us know if we are passing some wrong config too or if we need to remove some existing config too for using use-bq-manifest-file

the-other-tim-brown commented 10 months ago

@ranjanankur You can pass that in as props.setProperty(BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE.key, "true") in your code above. (source for property)

Let me know if this works for you

ranjanankur commented 10 months ago

We can see this code in the source but @the-other-tim-brown will you please confirm the version of hudi-gcp-hundle that we should use & if we have to use 0.14.0-SNAPSHOT then is it stable to use?

ranjanankur commented 10 months ago

& We are not able find the BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE in the 0.14.0-SNAPSHOT version. We are able to see that this part of code present on GitHub. Was the code bundled correctly? @the-other-tim-brown

the-other-tim-brown commented 10 months ago

@ranjanankur you have to build the snapshot from source since the final 0.14.0 release is not published yet. I would definitely start testing in a staging environment first. I've only done basic testing with this code myself but my initial findings are that it cuts down on data scanned compared to the view based approach.

I'm also adding in support for mapping the Hudi schema to BigQuery schema here if you're interested https://github.com/apache/hudi/pull/9482

ranjanankur commented 10 months ago

Hi @the-other-tim-brown We will not able to use snapshot version in our production use case. This is blocking all our streaming use case because in most of them we create BigQuery table on Hudi table. Fix for the above problem is to introduce BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE in BigQuerySync tool code but as we are using 0.13.1 version, we are not able use this option.

Is there any expected release time for the 0.14.0 version release? Please help on this. We might have to remove hudi in all our streaming usecase where BigQuery is expected as sink.

ranjanankur commented 10 months ago

Or Is it possible to make some 0.13.x release where this can fixed? I think this a major blocker for everyone who is using GCP as cloud and using BigQuery. Please help on this.

the-other-tim-brown commented 10 months ago

The 0.14.0 rc1 was rejected due to some bugs that were found. Second release candidate should be cut soon.

If you want some more flexibility on your side, you could also create a fork of the Hudi repo that you run in your production environments. Or if it's just these meta sync tools that are being updated fast enough in the official releases, you can create a small repo with just copies of these and then include them on your path. Feel free to DM me on the Hudi slack about this approach since I'm currently doing both of these approaches at my current job.