Closed JIAQIA closed 1 month ago
Thanks @JIAQIA ! We'll get this reviewed early next week.
Addressed #3084
@MthwRobinson I've made updates to the CHANGELOG.md.
Thanks @JIAQIA ! Getting this merged into a feature branch so CI can run, and will get it merged into main from there.
See #3126
@JIAQIA - Looks like there were a few unit test failures. See this CI job. Once you fix those, you can reopen a PR from your fork into main
. You'll be able to run those locally with make test
.
Initially, I underestimated this issue. After continuously adjusting the code, I discovered a problem with the current design of Unstructured regarding language detection. The language metadata is added after executing the partition through the apply_lang_meta function.
This causes a critical conflict:
During the partition process, language information is often needed for functions such as is_possible_narrative_text. If we only add language information after partitioning, it results in:
Issue 1 is the problem I encountered and am trying to solve. However, under the current architecture, solving issue 1 may exacerbate issue 2.
To avoid repeated calls to the detect_languages function when invoking partition_xxx methods, the current design uses languages=[""] as a workaround. This is not easy to understand and maintain for others. Although this is a compromise under current circumstances, it is not ideal.
Given that LLMs are fundamentally language models and Unstructured is a International library, addressing language detection and recognition should be important for future refactoring.
Here's my proposed approach:
These steps can be implemented gradually. Implementing step 1 alone would have a significant positive impact. Step 2 could resolve 80% of the current issues.
This analysis is based on my reading of the source code after encountering these issues. Experts within Unstructured will undoubtedly have deeper insights, but as a non-native English developer, I hope my perspective can contribute positively to Unstructured's development. Apologies for any inaccuracies.
My current modifications might lead to more frequent calls to detect_languages to solve the multilingual issue, despite my efforts to minimize such occurrences. Due to time constraints, I couldn't conduct a detailed performance evaluation of my changes. This task might need to be addressed by Unstructured's developers.
For testing, I used the current project's make test. Due to network or other reasons, I couldn't pass all tests even on the main branch. Given my limited time (as I have my own platform library to maintain), I ensured that the test results on my modified version are consistent with the main branch on my Mac.
If there are any issues, please let me know, and I'll try to fix them. I hope Unstructured continues to improve, and I appreciate the excellent work Unstructured has done to benefit developers worldwide.