databricks-demos / dbdemos

Demos to implement your Databricks Lakehouse
Other
255 stars 80 forks source link

llm-rag-chatbot - 01-Data-Preparation-and-Index #90

Closed mferrari0 closed 5 months ago

mferrari0 commented 6 months ago

Running the following cell in the notebook mentioned above:

(spark.table("raw_documentation")
      .withColumn('content', F.explode(parse_and_split('text')))
      .withColumn('embedding', get_embedding('content'))
      .drop("text")
      .write.mode('overwrite').saveAsTable("databricks_documentation"))

display(spark.table("databricks_documentation"))

results in an error: Screenshot 2023-12-08 163213

Has anybody experienced this?

QuentinAmbard commented 6 months ago

hey, I can't reproduce, can you try with .filter('text is not null') :

(spark.table("raw_documentation")
      .filter('text is not null')
      .withColumn('content', F.explode(parse_and_split('text')))
      .withColumn('embedding', get_embedding('content'))
      .drop("text")
      .write.mode('overwrite').saveAsTable("databricks_documentation"))

display(spark.table("databricks_documentation"))
mferrari0 commented 6 months ago

Thanks @QuentinAmbard. However, it did't solve the bug because there was another reason: since model serving is not available in my region, I had to create my own serving endpoint. The one I created was powered by a CPU with "Small" selected for the setting "Compute Scaleout". This turned out to be not enough, causing the timeout error I showed above. I changed from CPU to GPU and it ran without problems.