Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.25k stars 767 forks source link

bug/Cannot partition doc files with multi-byte names #3652

Open Snowman-s opened 2 months ago

Snowman-s commented 2 months ago

Describe the bug When calling unstructured.partition.doc.partition_doc with a doc file with multi-byte name (I checked: 文章.doc/风格.doc), it fails with an error.

To Reproduce

  1. Create empty *.doc file using Word, and name it 文章.doc or 风格.doc. I attached that files below: empty_docs.zip

  2. Run the code below:

    
    from unstructured.partition.doc import partition_doc

doc_path = "文章.doc"

doc_path = "风格.doc"

elements = partition_doc(filename=doc_path)


3. It will throw: 

Traceback (most recent call last): File "\utf8error.py", line 4, in elements = partition_doc(filename=doc_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\site-packages\unstructured\documents\elements.py", line 605, in wrapper elements = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "\site-packages\unstructured\file_utils\filetype.py", line 731, in wrapper elements = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "\site-packages\unstructured\file_utils\filetype.py", line 687, in wrapper elements = func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "\site-packages\unstructured\partition\doc.py", line 89, in partition_doc convert_office_doc( File "\site-packages\unstructured\partition\common.py", line 429, in convert_office_doc message = output.stdout.decode().strip() ^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 32: invalid start byte



(I replaced private directory with \<home>. Also replaced directory Python installed with \<python>)

**Expected behavior**
The code should not throw an exception. 

**Screenshots**
**Environment Info**
- Windows 11 Home
- Python 3.11.9
- unstructured: 0.15.13

**Additional context**
English(alphabet only) filename didn't cause an exception to be thrown.
scanny commented 2 months ago

@Snowman-s I don't believe this is directly related to the filename. You can test that theory by making a copy of the file and changing its name to something like document.doc. I believe you'll see the same results.

You can see where this error occurs, the code is capturing the soffice command output for logging purposes: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/common.py#L309

Are you running on Windows? There are some possible problems with the encoding of the terminal output not being utf-8.

Snowman-s commented 2 months ago

@scanny Thanks for reply!

I don't believe this is directly related to the filename. You can test that theory by making a copy of the file and changing its name to something like document.doc. I believe you'll see the same results.

I have re-tested the English filename and reconfirmed that it does not produce any errors. During that verification, I noticed that it fails not only if the file name contains multi-byte characters, but also if the path to the file contains multi-byte characters.

Are you running on Windows?

Yes. I'm using Windows 11.

scanny commented 2 months ago

Ahh, interesting. That leads me to believe that the input filename is echoed on stdout somewhere and that's where it's failing (and why it's not failing until we try to read stdout). In any case, the encoding used on stdout on your machine is not utf-8 it appears.

Some useful detail on the underlying problem here: https://github.com/python/cpython/issues/105312

@Snowman-s what happens if you set PYTHONENCODING=utf-8 before running your code? https://stackoverflow.com/a/7865013/1902513

scanny commented 2 months ago

Engineering note: one plausible solution to this is to avoid attempts to decode the captured stdout bytes and simply use str(output.stdout) instead. Rationale:

Snowman-s commented 2 months ago

@scanny

@Snowman-s what happens if you set PYTHONENCODING=utf-8 before running your code? https://stackoverflow.com/a/7865013/1902513

I ran $env:PYTHONIOENCODING="utf-8:surrogateescape"; python <code>.py, and it still throws the same exception. Same for sys.stdout.reconfigure(encoding='utf-8').


I've forgotten about it until now, but here's the version of soffice. (It is Windows 64 bit version.)

> soffice --version
LibreOffice 24.8.1.2 87fa9aec1a63e70835390b81c40bb8993f1d4ff6
scanny commented 2 months ago

@Snowman-s that's good to know. That narrows down the possible solutions.

Engineering note: I think this rules out us being able to affect how LibreOffice encodes messages it writes to stdout. The options I can think of are these:

  1. Use a try/except block as I mentioned above, falling back to str(bytes_from_stdout) on UnicodeDecodeError which would be mostly readable.
  2. Detect Windows and use the locale encoding to decode stdout bytes in that case.
  3. Use a try/except and use chardet as a backup to auto-detect encoding.

My vote is option 1 since this is for logging, not for UI.