apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[Inquiry] Does HoodieIndexer can Do Indexing for RLI Async Fashion #10815

Closed soumilshah1995 closed 6 months ago

soumilshah1995 commented 6 months ago

Greetings,

I hope this message finds you well. There was a recent discussion in the group regarding the possibility of changing the index type from Bloom to RLI on an older table. My understanding was that RLI needed to be created on a fresh table. However, there have been discussions on Slack suggesting that we can utilize the Hudi indexer for this purpose.

Before proceeding, I would like to kindly request clarification and verification on whether such a transition is feasible or not. Your insights on this matter would be greatly appreciated.

spark-submit \
    --class org.apache.hudi.utilities.HoodieIndexer \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0' \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
     --mode scheduleAndExecute \
    --base-path file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/bronze_orders \
    --table-name bronze_orders \
    --index-types RECORD_INDEX \
    --hoodie-conf "hoodie.metadata.enable=true" \
    --hoodie-conf "hoodie.metadata.index.async=true" \
    --hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
    --hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
    --parallelism 2 \
    --spark-memory 2g

Error


/03/04 18:05:57 ERROR UtilHelpers: Indexer failed
org.apache.hudi.exception.HoodieMetadataException: Failed to index partition [record_index]
    at org.apache.hudi.table.action.index.RunIndexActionExecutor.execute(RunIndexActionExecutor.java:181)
    at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.index(HoodieSparkCopyOnWriteTable.java:308)
    at org.apache.hudi.client.BaseHoodieWriteClient.index(BaseHoodieWriteClient.java:1009)
    at org.apache.hudi.utilities.HoodieIndexer.scheduleAndRunIndexing(HoodieIndexer.java:294)
    at org.apache.hudi.utilities.HoodieIndexer.lambda$start$1(HoodieIndexer.java:199)
    at org.apache.hudi.utilities.UtilHelpers.retry(UtilHelpers.java:602)
    at org.apache.hudi.utilities.HoodieIndexer.start(HoodieIndexer.java:186)
    at org.apache.hudi.utilities.HoodieIndexer.main(HoodieIndexer.java:155)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
24/03/04 18:05:57 ERROR HoodieIndexer: Indexing with basePath: file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/bronze_orders, tableName: bronze_orders, runningMode: scheduleAndExecute failed
24/03/04 18:05:57 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/03/04 18:05:57 INFO SparkUI: Stopped Spark web UI at http://soumils-mbp:8090
24/03/04 18:05:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/03/04 18:05:57 INFO MemoryStore: MemoryStore cleared
24/03/04 18:05:57 INFO BlockManager: BlockManager stopped
24/03/04 18:05:57 INFO BlockManagerMaster: BlockManagerMaster stopped
24/03/04 18:05:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/03/04 18:05:57 INFO SparkContext: Successfully stopped SparkContext
24/03/04 18:06:01 INFO ShutdownHookManager: Shutdown hook called
24/03/04 18:06:01 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/s_1bjv516pn_mck29cwdwxnm0000gp/T/spark-b008e82d-b028-4eb0-b703-830d213bb5b4
24/03/04 18:06:01 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/s_1bjv516pn_mck29cwdwxnm0000gp/T/spark-197d2ead-2361-4fdc-988e-bd785a89a1fb
soumilshah@Soumils-MBP DeltaStreamer % 

Thank you for your attention to this inquiry.

soumilshah1995 commented 6 months ago

Slack Thread https://apache-hudi.slack.com/archives/C4D716NPQ/p1709331774171119

ad1happy2go commented 6 months ago

@soumilshah1995 Looks like you missed to enable the record index.

I was able to run successfully below command by just adding --hoodie-conf "hoodie.metadata.record.index.enable=true" \ -

~/plain_spark/spark-3.2.4-bin-hadoop3.2/bin/spark-submit \
    --class org.apache.hudi.utilities.HoodieIndexer \
    --packages 'org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.1' \
    --master 'local[*]' \
    --executor-memory 1g \
    ~/jars/0.14.1/spark32/hudi-utilities-slim-bundle_2.12-0.14.1.jar \
     --mode scheduleAndExecute \
    --base-path file:///tmp/hudi_cow_read \
    --table-name hudi_table \
    --index-types RECORD_INDEX \
    --hoodie-conf "hoodie.metadata.enable=true" \
    --hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
    --hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
    --hoodie-conf "hoodie.metadata.record.index.enable=true" \
    --hoodie-conf "hoodie.metadata.index.async=true" \
    --parallelism 2 \
    --spark-memory 2g 
image
soumilshah1995 commented 6 months ago

let me try this :D

soumilshah1995 commented 6 months ago

That approach does indeed work.

I have a basic question: If you wish to enable record-level indexing (RLi) in offline mode but disable it in regular jobs, how would we go about it? Would there be a flag to deactivate RLi during writing, allowing us to execute it asynchronously?

ad1happy2go commented 6 months ago

@soumilshah1995 If we want to do that, we can disable hoodie.metadata.record.index.enable in the ingestion, so it doesn't use RLI during regular ingestion.

soumilshah1995 commented 6 months ago

Interesting Thanks

nsivabalan commented 5 months ago

hey @ad1happy2go @codope : looks like there is some mis understanging on how to use async indexer. when enabling async indexer to build say RLI, ingestion also should have async indexing enable for RLI. we can't completely disable from regular ingestion job. Can you folks follow up on any doc enhancements. CC @soumilshah1995

codope commented 4 months ago

Yes that is correct. It's mentioned as a limitation in https://hudi.apache.org/docs/metadata_indexing#caveats

soumilshah1995 commented 4 months ago

Thanks for heads up guys