danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://danswer.ai
Other
10.56k stars 1.32k forks source link

Few connectors failing to index due to Empty or missing text for embedding #2000

Closed ahmadassaf closed 1 hour ago

ahmadassaf commented 3 months ago

I have few connectors to Notion, product board, confluence and few of them are failing with the following:

Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 199, in _run_indexing
    new_docs, total_batch_chunks = indexing_pipeline(
                                   ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/utils/timing.py", line 31, in wrapped_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/indexing_pipeline.py", line 175, in index_doc_batch
    chunks_with_embeddings = embedder.embed_chunks(
                             ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/embedder.py", line 99, in embed_chunks
    embeddings = self.embedding_model.encode(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/natural_language_processing/search_nlp_models.py", line 79, in encode
    raise ValueError(f"Empty or missing text for embedding: {texts}")
ValueError: Empty or missing text for embedding: []
karlnewell commented 3 months ago

I am running into the same issue with the Confluence and Web connector. I am running Danswer on Kubernetes from the provided Helm charts and latest commits on main. I will attempt to find more information/logs.

Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 199, in _run_indexing
    new_docs, total_batch_chunks = indexing_pipeline(
                                   ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/utils/timing.py", line 31, in wrapped_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/indexing_pipeline.py", line 177, in index_doc_batch
    chunks_with_embeddings = embedder.embed_chunks(
                             ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/embedder.py", line 99, in embed_chunks
    embeddings = self.embedding_model.encode(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/natural_language_processing/search_nlp_models.py", line 79, in encode
    raise ValueError(f"Empty or missing text for embedding: {texts}")
ValueError: Empty or missing text for embedding: []
karlnewell commented 3 months ago

Looks like this was fixed in https://github.com/danswer-ai/danswer/commit/348a2176f01e864b88040f0f1c8cd8643695200b

However, the Helm charts specify container imagePullPolicy: IfNotPresent which means a Helm upgrade or pod restart won't update the container image since the tag is not changing (it's set to latest).

I'll submit a PR to update the imagePullPolicy via Helm values.

ahmadassaf commented 3 months ago

@karlnewell things seem to have evolved a bit but now its running into a new error

Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 177, in _run_indexing
    for doc_batch in doc_batch_generator:
  File "/app/danswer/connectors/productboard/connector.py", line 232, in poll_source
    for document in chain(
  File "/app/danswer/connectors/productboard/connector.py", line 177, in _get_objectives
    yield Document(
          ^^^^^^^^^
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Document
metadata -> state
  none is not an allowed value (type=type_error.none.not_allowed)
github-actions[bot] commented 1 week ago

This issue is stale because it has been open 75 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions[bot] commented 1 hour ago

This issue was closed because it has been stalled for 90 days with no activity.