danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
10.36k stars 1.24k forks source link

Confluence connector gives error for movie files #1648

Closed tomska-pfsw closed 3 months ago

tomska-pfsw commented 3 months ago

I'm trying out Danswer and the Confluence (cloud) connector. It seems to connect fine, but the workspaces we want to index contain some pages that have uploaded movie (AVI/MP4) files. The connector then gets an error message and stops. See errors below. It would be a lot better if it just skipped unknown files like that.


  File "/app/danswer/background/indexing/run_indexing.py", line 177, in _run_indexing
    for doc_batch in doc_batch_generator:
  File "/app/danswer/connectors/confluence/connector.py", line 478, in poll_source
    doc_batch, num_pages = self._get_doc_batch(
                           ^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/connectors/confluence/connector.py", line 429, in _get_doc_batch
    attachment_text = self._fetch_attachments(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/connectors/confluence/connector.py", line 375, in _fetch_attachments
    raise e
  File "/app/danswer/connectors/confluence/connector.py", line 368, in _fetch_attachments
    extract = extract_file_text(
              ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/file_processing/extract_file_text.py", line 268, in extract_file_text
    raise RuntimeError(f"Unprocessable file type: {file_name}")
RuntimeError: Unprocessable file type: 2018-01-10 TRAM and CJM.mp4```

The same happens for some other file types, like "tar" files.
tomska-pfsw commented 3 months ago

Sorry, I now saw that there is a docker compose environment variable to set to avoid this.

logan-hcg commented 3 months ago

The setting is CONTINUE_ON_CONNECTOR_FAILURE=true