Open JIAQIA opened 2 months ago
Hi @JIAQIA ! If you have a draft fix, you can open a PR from your fork into main
and we can review.
@MthwRobinson M Thanks for your response.
Here is my PR screenshot, FYI
I need to clarify that there are currently several test cases that have not passed. I have marked them with TODO because I am not sure how to modify these test cases. In my understanding, the text in the test cases does not seem to be in English. As for how to determine the format, it might require a new strategy, so I did not modify the test cases rashly and left TODOs instead.
Additionally, some test cases might be failing due to network issues in China. I have not thoroughly investigated the cause, as I still have unfinished work. Therefore, if you find any problems during the review, I can assist with the modifications.
Describe the bug While parsing document formats like DOCX and MD (these are the primary formats I use, so this issue might occur with other formats as well), I encountered a strange phenomenon where the content of the document is incorrectly identified as the Title by Unstructured. The screenshots below illustrate this issue:
Original text:
But after parsing with Unstructured, the result is as follows:
Through debugging the source code, I found that the issue lies with functions like
is_possible_narrative_text
. Although these functions provide recognition capabilities for different languages, they are not correctly passed through the entire partition process. This causes the output to be judged as English regardless of the actual language, leading to the above-mentioned issue.To Reproduce The test case to reproduce this issue has been provided in the code fix. To reproduce with DOCX format, the code is as follows:
You can reproduce the issue by copying the main code to
test_unstructured/partition/test_docx
.Expected behavior I expect the normal text to be recognized correctly and the issue of Chinese content being identified as Title to be fixed.
Screenshots Screenshots as above.
Environment Info ...
Additional context I have attempted to fix this issue, and the original test cases
test_docx
andtest_md
can all pass locally. I have also provided corresponding test code for the fix (currently only DOCX and MD formats have been fixed; time permitting, I can continue to provide fixes as needed).However, I am unsure which branch to request the PR to. I did not find any Contributor guidelines in the Readme, so I am raising this issue in the hope that a Committer can provide some advice.