Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.8k stars 626 forks source link

### feat(unstructured/partition/docx.py): Fix Compatibility Issue with Chinese Text in Document Parsing #3096

Closed JIAQIA closed 1 month ago

JIAQIA commented 2 months ago

Initially, I underestimated this issue. After continuously adjusting the code, I discovered a problem with the current design of Unstructured regarding language detection. The language metadata is added after executing the partition through the apply_lang_meta function.

This causes a critical conflict:

During the partition process, language information is often needed for functions such as is_possible_narrative_text. If we only add language information after partitioning, it results in:

  1. Inaccuracy in partitioning in a multilingual environment.
  2. Potential repeated calls to detect_languages.

Issue 1 is the problem I encountered and am trying to solve. However, under the current architecture, solving issue 1 may exacerbate issue 2.

To avoid repeated calls to the detect_languages function when invoking partition_xxx methods, the current design uses languages=[""] as a workaround. This is not easy to understand and maintain for others. Although this is a compromise under current circumstances, it is not ideal.

Given that LLMs are fundamentally language models and Unstructured is a International library, addressing language detection and recognition should be important for future refactoring.

Here's my proposed approach:

  1. Language information should be included as metadata in Document-Page-Element from the outset, allowing subsequent processes to use this metadata directly, avoiding recomputation and the use of the [""] workaround.
  2. Abstract a set of utility functions dedicated to language detection, separating language-related code from Document-Page-Element code.
  3. For tokenization and other language-specific processes, we can encapsulate a unified plugin standard, enabling engineers from different countries to contribute their own document partition plugins. For instance, for Chinese, we could introduce jieba for better structural recognition.

These steps can be implemented gradually. Implementing step 1 alone would have a significant positive impact. Step 2 could resolve 80% of the current issues.

This analysis is based on my reading of the source code after encountering these issues. Experts within Unstructured will undoubtedly have deeper insights, but as a non-native English developer, I hope my perspective can contribute positively to Unstructured's development. Apologies for any inaccuracies.

My current modifications might lead to more frequent calls to detect_languages to solve the multilingual issue, despite my efforts to minimize such occurrences. Due to time constraints, I couldn't conduct a detailed performance evaluation of my changes. This task might need to be addressed by Unstructured's developers.

For testing, I used the current project's make test. Due to network or other reasons, I couldn't pass all tests even on the main branch. Given my limited time (as I have my own platform library to maintain), I ensured that the test results on my modified version are consistent with the main branch on my Mac.

If there are any issues, please let me know, and I'll try to fix them. I hope Unstructured continues to improve, and I appreciate the excellent work Unstructured has done to benefit developers worldwide.

MthwRobinson commented 2 months ago

Thanks @JIAQIA ! We'll get this reviewed early next week.

MthwRobinson commented 2 months ago

Addressed #3084

JIAQIA commented 1 month ago

@MthwRobinson I've made updates to the CHANGELOG.md.

MthwRobinson commented 1 month ago

Thanks @JIAQIA ! Getting this merged into a feature branch so CI can run, and will get it merged into main from there.

MthwRobinson commented 1 month ago

See #3126

MthwRobinson commented 1 month ago

@JIAQIA - Looks like there were a few unit test failures. See this CI job. Once you fix those, you can reopen a PR from your fork into main. You'll be able to run those locally with make test.