Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.67k stars 707 forks source link

Problems when I parsing Chineses PDF documents #2999

Open WangJiaxin-x opened 4 months ago

WangJiaxin-x commented 4 months ago

Hi, When I use partition_type(file=io.BytesIO(file.file.read()),languages=["chi_sim"]) to parse Chinese pdf documents, I found the result was to split the paragraph text into a line text as a elemet. And another problem is element type isn't accurate, should be UncategorizedText but actually is Title

MthwRobinson commented 4 months ago

Hi @WangJiaxin-x - do you have an example document available that we could use to replicate this? Thanks!

idiotTest commented 4 months ago

ok,let me give an example.I will give the two documents,One is the raw pdf file,another is the json which i use the code below to get.The bug is some elements should be Title,But it is UncategorizedText in the result. Also, the result shows that some paragraphs can't recognized ,you can see the json,a line in paragraphs is recognized as a element.So a paragraphs is split into many elements.I think it is not a good result. Hope your reply,Thanks!!!

import json
from typing import Iterable, Optional

from unstructured.documents.elements import Element
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json, _fix_metadata_field_precision, elements_to_dicts

elements = partition_pdf(filename=r"C:\Users\A\Desktop\test.pdf",
                         languages=["chi_sim"])  # bytes -> BinaryIO

def elements_to_json_chi(
        elements: Iterable[Element],
        filename: Optional[str] = None,
        indent: int = 4,
        encoding: str = "utf-8",
) -> Optional[str]:
    """Saves a list of elements to a JSON file if filename is specified.

    Otherwise, return the list of elements as a string.
    """
    # -- serialize `elements` as a JSON array (str) --
    precision_adjusted_elements = _fix_metadata_field_precision(elements)
    element_dicts = elements_to_dicts(precision_adjusted_elements)
    json_str = json.dumps(element_dicts, ensure_ascii=False, indent=indent, sort_keys=True)

    if filename is not None:
        with open(filename, "w", encoding=encoding) as f:
            f.write(json_str)
        return None

    return json_str

elements_to_json_chi(elements, filename="./test_json.json")

also,it shows that in package unstructured.staging.base the func elements_to_json has some encoding bugs in chinese.The parameter ensure_ascii in json.dump,I think shoule be false. test.pdf test_json.json

MthwRobinson commented 4 months ago

Thank you for the example! We're tracking this and will investigate as soon as we can.

idiotTest commented 4 months ago

ok,thanks for your help!!!