Handle file text extraction errors

Hey

When we try to extract text from a file and it raise an exception, a common one was:

Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 168, in _run_indexing
    for doc_batch in doc_batch_generator:
  File "/app/danswer/connectors/sharepoint/connector.py", line 159, in _fetch_from_sharepoint
    doc_batch.append(_convert_driveitem_to_document(driveitem))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/connectors/sharepoint/connector.py", line 43, in _convert_driveitem_to_document
    file_text = extract_file_text(
                ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/file_processing/extract_file_text.py", line 286, in extract_file_text
    return xlsx_to_text(file)
           ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/file_processing/extract_file_text.py", line 221, in xlsx_to_text
    workbook = openpyxl.load_workbook(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
    reader = ExcelReader(filename, read_only, keep_vba,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
    self.archive = _validate_archive(fn)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
    archive = ZipFile(filename, 'r')
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1302, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python3.11/zipfile.py", line 1369, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

with the method openpyxl.load_workbook(file).

But this exception make all indexing process failed.

Capture d’écran de 2024-06-25 15-07-30

So I suggest to handle errors by logging bad files' names, so that the indexing process will continue:

Capture d’écran de 2024-06-25 15-45-21

To go further, I think that grouping all indexing errors to inform users that X files are bad formatted for example, could be a good idea. But I'm not as good in 🐍

danswer-ai / danswer

Handle file text extraction errors #1701