Open ranjanankur opened 11 months ago
@ranjanankur Can you please create a GCP ticket asking them what should be the preferred way for updating the manifest file?
Hi @codope Any update on this. We are blocked due to this. We are finding very hard to use Hudi in our streaming pipeline due to this. Please let us know any solution for this.
@ranjanankur I'm taking a look at this and tracking with the JIRA ticket here as well https://issues.apache.org/jira/browse/HUDI-6672
I've reached out to the Google Cloud to confirm that this is an issue with updating the manifest while a query is running. The solution I'm working on will version these manifests so we do not modify the file while a query is in flight.
I've reached out to the Google Cloud to confirm that this is an issue with updating the manifest while a query is running.
This sounds like the likely cause (each table we make sure we read the exact same file that was specified in the URI). The solution that uses a view for compatibility with Hudi and BigQuery is inherently flawed. Using the newly contributed manifest file approach is going to be more robust along several dimensions.
Hi @emkornfield which new manifest file you are mentioning? Which part of the BigQuerySync code do I need to change?
@ankur334 There is a new path in the BigQuerySyncTool which uses this new BigQuery feature. You can also look at this post from Google Cloud: https://cloud.google.com/blog/products/data-analytics/bigquery-manifest-file-support-for-open-table-format-queries/
Hi @the-other-tim-brown , @emkornfield
In the above article it is mentioned to use the use-bq-manifest-file
flag while running BigQuerySyncTool
function to sync Hudi table with BigQuery tables.
It is mentioned to use the hudi-gcp-bundle-0.14.0-SNAPSHOT.jar
. We have upgraded this JAR too but we are finding difficulty to add this use-bq-manifest-file
in the code.
Instead of using spark-summit to run this code, we just call BigQuerySyncTool
function from our code itself & we are confused where and how to pass this flag.
Please let us know where to pass this flag.
import org.apache.hudi.gcp.bigquery.BigQuerySyncConfig._
import org.apache.hudi.gcp.bigquery.BigQuerySyncTool
def getBigQueryProps: Properties = {
val props = new Properties()
props.setProperty(BIGQUERY_SYNC_PROJECT_ID.key, projectId)
props.setProperty(BIGQUERY_SYNC_DATASET_NAME.key, datasetName)
props.setProperty(BIGQUERY_SYNC_DATASET_LOCATION.key, datasetLocation)
props.setProperty(BIGQUERY_SYNC_TABLE_NAME.key, tableName)
props.setProperty(BIGQUERY_SYNC_SOURCE_URI.key, s"$tablePath/$firstLevelPartition=*")
props.setProperty(BIGQUERY_SYNC_SOURCE_URI_PREFIX.key, s"$tablePath/")
props.setProperty(BIGQUERY_SYNC_SYNC_BASE_PATH.key, tablePath)
props.setProperty(BIGQUERY_SYNC_PARTITION_FIELDS.key, partitionKey)
props.setProperty(META_SYNC_BASE_PATH.key(), tablePath)
props.setProperty(BIGQUERY_SYNC_USE_FILE_LISTING_FROM_METADATA.key, "true")
props.setProperty(BIGQUERY_SYNC_ASSUME_DATE_PARTITIONING.key, "false")
props
}
new BigQuerySyncTool(getBigQueryProps).syncHoodieTable()
& please let us know if we are passing some wrong config too or if we need to remove some existing config too for using use-bq-manifest-file
@ranjanankur You can pass that in as props.setProperty(BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE.key, "true")
in your code above. (source for property)
Let me know if this works for you
We can see this code in the source but @the-other-tim-brown will you please confirm the version of hudi-gcp-hundle
that we should use & if we have to use 0.14.0-SNAPSHOT
then is it stable to use?
& We are not able find the BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE
in the 0.14.0-SNAPSHOT
version. We are able to see that this part of code present on GitHub. Was the code bundled correctly?
@the-other-tim-brown
@ranjanankur you have to build the snapshot from source since the final 0.14.0 release is not published yet. I would definitely start testing in a staging environment first. I've only done basic testing with this code myself but my initial findings are that it cuts down on data scanned compared to the view based approach.
I'm also adding in support for mapping the Hudi schema to BigQuery schema here if you're interested https://github.com/apache/hudi/pull/9482
Hi @the-other-tim-brown
We will not able to use snapshot
version in our production use case. This is blocking all our streaming use case because in most of them we create BigQuery table on Hudi table. Fix for the above problem is to introduce BIGQUERY_SYNC_USE_BQ_MANIFEST_FILE
in BigQuerySync tool code but as we are using 0.13.1
version, we are not able use this option.
Is there any expected release time for the 0.14.0
version release? Please help on this. We might have to remove hudi in all our streaming usecase where BigQuery is expected as sink.
Or Is it possible to make some 0.13.x
release where this can fixed? I think this a major blocker for everyone who is using GCP as cloud and using BigQuery. Please help on this.
The 0.14.0 rc1 was rejected due to some bugs that were found. Second release candidate should be cut soon.
If you want some more flexibility on your side, you could also create a fork of the Hudi repo that you run in your production environments. Or if it's just these meta sync tools that are being updated fast enough in the official releases, you can create a small repo with just copies of these and then include them on your path. Feel free to DM me on the Hudi slack about this approach since I'm currently doing both of these approaches at my current job.
Problem while reading from BQ tables which are synced on Hudi table
We have one streaming application where we are getting data from Kafka. We are applying transformation using Spark Structure Streaming and then loading the data to Hudi tables(GCS bucket). After creating hudi tables we are syncing the same Hudi table with GCP BigQuery using BigQuerySync tool.
Now we are facing we are very weird problem and if our understanding of these bug is correct then we might have to remove hudi from streaming use case.
Let me explain the problem.
BigQuery sync tools internally is creating two BQ tables and one View.
For maintaining latest UPSERT. Hudi is creating one latest-snapshot.csv. in the gs:////.hoodie/manifest/latest-snapshot.csv
After every microbatch Hudi is actually replacing this file. It is deleting and recreating this file. Our batch duration of streaming pipeline is 5 min.
So after 5 min, new file will be created with the same name.
Now the problem is let suppose, I am calling
BQ view
and apply some join or some operation with other tables which takes more than 5 min to run then our BQ job throw error because latest-snapshot.csv is changed during this.Error - Not found .hoodie/manifest/latest-snapshot.csv (Version 1690978310289362)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
When I run some BiqQuery Statement then it should run without error.
Environment Description
Hudi version : 0.13.1
Spark version : 3.1.3
Hive version : 3.1.3
Storage (HDFS/S3/GCS..) : GCP
Running on Docker? (yes/no) : no