databricks-demos / dbdemos

Demos to implement your Databricks Lakehouse
Other
286 stars 96 forks source link

RAG Demo (01-PDF-Advanced-Data-Preparation) fails on Azure #111

Closed bkuan closed 5 months ago

bkuan commented 8 months ago

Customers on Azure are running into issue while running the RAG demo advanced-Data-Prep when running:

(spark.readStream.table('pdf_raw')
      .withColumn("content", F.explode(read_as_chunk("content")))
      .withColumn("embedding", get_embedding("content"))
      .selectExpr('path as url', 'content', 'embedding')
  .writeStream
    .trigger(availableNow=True)
    .option("checkpointLocation", f'dbfs:{volume_folder}/checkpoints/pdf_chunk')
    .table('databricks_pdf_documentation').awaitTermination())

Getting the error:

StreamingQueryException: [STREAM_FAILED] Query [id = e14863ab-3fbc-47f5-9076-541c3fb5ba20, runId = b67a8ba4-4a98-4d92-9321-b1171d7ad856] terminated with exception: Job aborted due to stage failure: Task 3 in stage 59.0 failed 4 times, most recent failure: Lost task 3.3 in stage 59.0 (TID 153) (10.139.64.111 executor 0): org.apache.spark.SparkRuntimeException: [UDF_ERROR.PAYLOAD] Execution of function read_as_chunk(content#3152) failed  - failed to set payload.
== Error ==
INVALID_ARGUMENT: cannot import name 'Iterator' from 'typing_extensions' (/databricks/python3/lib/python3.10/site-packages/typing_extensions.py)

We have tried to reinstalling "typing_extensions" with import typing_extension and pip install typing_extenion, but that has not fixed the issue.

Another customer have been running into similar issue on AWS with shared cluster:

(spark.readStream.table('pdf_raw')
      .withColumn("content", sf.explode(read_as_chunk("content")))
      .withColumn("embedding", get_embedding("content"))
      .selectExpr('path as url', 'content', 'embedding')
  .writeStream
    .trigger(availableNow=True)
    .option("checkpointLocation", f'dbfs:{uc_volume_folder}/checkpoints/pdf_chunk')
    .table('cert_pdf_data').awaitTermination())

https://databricks.lightning.force.com/lightning/r/Case/500Vp000003jj9GIAQ/view

bkuan commented 8 months ago

This issue is reproducible in our Azure field environment

QuentinAmbard commented 8 months ago

did you try to add from typing import Iterator ? I don't know why it mentions typing_extensions

QuentinAmbard commented 5 months ago

hey, we changed the code so that it doesn't require to install the OCR. This should be fixed now - feel free to reopen if needed!