Open Snowman-s opened 2 months ago
@Snowman-s I don't believe this is directly related to the filename. You can test that theory by making a copy of the file and changing its name to something like document.doc
. I believe you'll see the same results.
You can see where this error occurs, the code is capturing the soffice
command output for logging purposes:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/common.py#L309
Are you running on Windows? There are some possible problems with the encoding of the terminal output not being utf-8
.
@scanny Thanks for reply!
I don't believe this is directly related to the filename. You can test that theory by making a copy of the file and changing its name to something like document.doc. I believe you'll see the same results.
I have re-tested the English filename and reconfirmed that it does not produce any errors. During that verification, I noticed that it fails not only if the file name contains multi-byte characters, but also if the path to the file contains multi-byte characters.
Are you running on Windows?
Yes. I'm using Windows 11.
Ahh, interesting. That leads me to believe that the input filename is echoed on stdout
somewhere and that's where it's failing (and why it's not failing until we try to read stdout). In any case, the encoding used on stdout on your machine is not utf-8
it appears.
Some useful detail on the underlying problem here: https://github.com/python/cpython/issues/105312
@Snowman-s what happens if you set PYTHONENCODING=utf-8
before running your code?
https://stackoverflow.com/a/7865013/1902513
Engineering note: one plausible solution to this is to avoid attempts to decode the captured stdout bytes and simply use str(output.stdout)
instead. Rationale:
try/except UnicodeDecodeError
block.utf-8
for the entire running Python process and subprocesses.@scanny
@Snowman-s what happens if you set
PYTHONENCODING=utf-8
before running your code? https://stackoverflow.com/a/7865013/1902513
I ran $env:PYTHONIOENCODING="utf-8:surrogateescape"; python <code>.py
, and it still throws the same exception.
Same for sys.stdout.reconfigure(encoding='utf-8')
.
I've forgotten about it until now, but here's the version of soffice
. (It is Windows 64 bit version.)
> soffice --version
LibreOffice 24.8.1.2 87fa9aec1a63e70835390b81c40bb8993f1d4ff6
@Snowman-s that's good to know. That narrows down the possible solutions.
Engineering note: I think this rules out us being able to affect how LibreOffice encodes messages it writes to stdout
. The options I can think of are these:
try/except
block as I mentioned above, falling back to str(bytes_from_stdout)
on UnicodeDecodeError
which would be mostly readable.try/except
and use chardet
as a backup to auto-detect encoding.My vote is option 1 since this is for logging, not for UI.
Describe the bug When calling
unstructured.partition.doc.partition_doc
with a doc file with multi-byte name (I checked:文章.doc
/风格.doc
), it fails with an error.To Reproduce
Create empty
*.doc
file using Word, and name it文章.doc
or风格.doc
. I attached that files below: empty_docs.zipRun the code below:
doc_path = "文章.doc"
doc_path = "风格.doc"
elements = partition_doc(filename=doc_path)
Traceback (most recent call last): File "\utf8error.py", line 4, in
elements = partition_doc(filename=doc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "\site-packages\unstructured\documents\elements.py", line 605, in wrapper
elements = func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "\site-packages\unstructured\file_utils\filetype.py", line 731, in wrapper
elements = func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "\site-packages\unstructured\file_utils\filetype.py", line 687, in wrapper
elements = func( args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "\site-packages\unstructured\partition\doc.py", line 89, in partition_doc
convert_office_doc(
File "\site-packages\unstructured\partition\common.py", line 429, in convert_office_doc
message = output.stdout.decode().strip()
^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 32: invalid start byte