Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.8k stars 626 forks source link

bug/<Compatibility Issue with Chinese Text in Document Parsing> #3084

Open JIAQIA opened 2 months ago

JIAQIA commented 2 months ago

Describe the bug While parsing document formats like DOCX and MD (these are the primary formats I use, so this issue might occur with other formats as well), I encountered a strange phenomenon where the content of the document is incorrectly identified as the Title by Unstructured. The screenshots below illustrate this issue:

Original text:

image

But after parsing with Unstructured, the result is as follows:

image

Through debugging the source code, I found that the issue lies with functions like is_possible_narrative_text. Although these functions provide recognition capabilities for different languages, they are not correctly passed through the entire partition process. This causes the output to be judged as English regardless of the actual language, leading to the above-mentioned issue.

To Reproduce The test case to reproduce this issue has been provided in the code fix. To reproduce with DOCX format, the code is as follows:

def create_test_docx(file_path):
    from docx import Document as DocxDocument

    doc = DocxDocument()

    # Add title and text content
    doc.add_heading('春节放假通知', level = 1)
    doc.add_paragraph('\n')
    doc.add_paragraph('春节放假从大年 30 开始\n共计放假一个月\n比法定假期长三周\n')

    doc.add_heading('标题 2', level = 2)
    doc.add_heading('标题 3', level = 3)
    doc.add_heading('又一个标题 2', level = 2)

    doc.add_paragraph('正文普通\n')

    # Add list
    doc.add_paragraph('一组\n', style = 'ListBullet')
    doc.add_paragraph('二组\n', style = 'ListBullet')
    doc.add_paragraph('三组\n', style = 'ListBullet')

    doc.add_paragraph('继续正文\n')

    # Save document
    doc.save(file_path)

def test_partition_zh_docs() -> None:
    """
    Fix the issue of erroneously recognizing NarrativeText as Title when splitting Chinese DOCX documents
    """
    with tempfile.NamedTemporaryFile(suffix = ".docx", delete = False) as tmp:
        create_test_docx(tmp.name)
        elements = partition_docx(tmp.name)

        # Print or check partition results
        for element in elements:
            print(element)

        # Assertions
        assert any('春节放假通知' in element.text for element in elements)
        assert any('春节放假从大年 30 开始' in element.text for element in elements)
        assert any('标题 2' in element.text for element in elements)
        assert any('标题 3' in element.text for element in elements)
        assert any('又一个标题 2' in element.text for element in elements)
        assert any('正文普通' in element.text for element in elements)
        assert any('一组' in element.text for element in elements)
        assert any('二组' in element.text for element in elements)
        assert any('三组' in element.text for element in elements)
        assert any('继续正文' in element.text for element in elements)
        assert list(filter(lambda x: '正文普通' in x.text, elements))[0].category == 'NarrativeText'
        assert list(filter(lambda x: '一组' in x.text, elements))[0].category == 'ListItem'
        assert list(filter(lambda x: '继续正文' in x.text, elements))[0].category == 'NarrativeText'

You can reproduce the issue by copying the main code to test_unstructured/partition/test_docx.

Expected behavior I expect the normal text to be recognized correctly and the issue of Chinese content being identified as Title to be fixed.

Screenshots Screenshots as above.

Environment Info ...

Additional context I have attempted to fix this issue, and the original test cases test_docx and test_md can all pass locally. I have also provided corresponding test code for the fix (currently only DOCX and MD formats have been fixed; time permitting, I can continue to provide fixes as needed).

However, I am unsure which branch to request the PR to. I did not find any Contributor guidelines in the Readme, so I am raising this issue in the hope that a Committer can provide some advice.

JIAQIA commented 2 months ago
image
MthwRobinson commented 2 months ago

Hi @JIAQIA ! If you have a draft fix, you can open a PR from your fork into main and we can review.

JIAQIA commented 2 months ago

@MthwRobinson M Thanks for your response.

Here is my PR screenshot, FYI

image
JIAQIA commented 2 months ago

I need to clarify that there are currently several test cases that have not passed. I have marked them with TODO because I am not sure how to modify these test cases. In my understanding, the text in the test cases does not seem to be in English. As for how to determine the format, it might require a new strategy, so I did not modify the test cases rashly and left TODOs instead.

Additionally, some test cases might be failing due to network issues in China. I have not thoroughly investigated the cause, as I still have unfinished work. Therefore, if you find any problems during the review, I can assist with the modifications.