When we try to extract text from a file and it raise an exception, a common one was:
Traceback (most recent call last):
File "/app/danswer/background/indexing/run_indexing.py", line 168, in _run_indexing
for doc_batch in doc_batch_generator:
File "/app/danswer/connectors/sharepoint/connector.py", line 159, in _fetch_from_sharepoint
doc_batch.append(_convert_driveitem_to_document(driveitem))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/connectors/sharepoint/connector.py", line 43, in _convert_driveitem_to_document
file_text = extract_file_text(
^^^^^^^^^^^^^^^^^^
File "/app/danswer/file_processing/extract_file_text.py", line 286, in extract_file_text
return xlsx_to_text(file)
^^^^^^^^^^^^^^^^^^
File "/app/danswer/file_processing/extract_file_text.py", line 221, in xlsx_to_text
workbook = openpyxl.load_workbook(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
reader = ExcelReader(filename, read_only, keep_vba,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
self.archive = _validate_archive(fn)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
archive = ZipFile(filename, 'r')
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/zipfile.py", line 1302, in __init__
self._RealGetContents()
File "/usr/local/lib/python3.11/zipfile.py", line 1369, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
with the method openpyxl.load_workbook(file).
But this exception make all indexing process failed.
So I suggest to handle errors by logging bad files' names, so that the indexing process will continue:
To go further, I think that grouping all indexing errors to inform users that X files are bad formatted for example, could be a good idea. But I'm not as good in 🐍
Hey
When we try to extract text from a file and it raise an exception, a common one was:
with the method
openpyxl.load_workbook(file)
.But this exception make all indexing process failed.
So I suggest to handle errors by logging bad files' names, so that the indexing process will continue:
To go further, I think that grouping all indexing errors to inform users that X files are bad formatted for example, could be a good idea. But I'm not as good in 🐍