Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.15k stars 659 forks source link

bug/text-as-html-missing-content #3358

Open mpolomdeepsense opened 1 month ago

mpolomdeepsense commented 1 month ago

Describe the bug Sometimes when using chunking, the text_as_html for Table elements is missing some of the content compared to text property. Reasoning:

To Reproduce

import unstructured_client
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import elements_from_dicts

client = unstructured_client.UnstructuredClient(
    api_key_auth="...",
    server_url=" ...",
)

filename_a = r"doc.pdf"

with open(filename_a, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename_a,
        ),
        strategy = "hi_res",
        coordinates=True,
        hi_res_model_name = "yolox",
        chunking_strategy="by_page",
        split_pdf_page=False,
        include_page_breaks=True,
        output_format = "application/json",
        languages=['eng'],
    ),
)

resp = client.general.partition(req)

elements = elements_from_dicts(resp.elements)
tables = [e for e in elements if e.category == "Table"]
for table in tables:
    dataframe = pd.read_html(e.metadata.text_as_html)
    print(dataframe)

Expected behavior Chunked elements text and text_as_html contain the same content (text_as_html has that content parsed to an HTML table).

mpolomdeepsense commented 1 month ago

@christinestraub

christinestraub commented 1 month ago

@mpolomdeepsense Can you please share a pdf document that you're testing?

alastairmarchant commented 1 month ago

I have been encountering the same issue with a test PDF I created. The first row of the table is within elements[0].text but not elements[0].metadata.text_as_html. It was using this pdf test_pdf_table.pdf, and the following code.

>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(
...     filename="test_pdf_table.pdf",
...     url=None,
...     infer_table_structure=True,
...     strategy="hi_res",
... )
>>> elements[0].text
'Header 1 Text 1.1 Text 1.2 Header 2 Text 2.1 Text 2.2 Header 3 Text 3.1 Text 3.2'
>>> elements[0].metadata.text_as_html
'<table><tbody><tr><td>Text 1.1</td><td>Text 2.1</td><td>Text 3.1</td></tr><tr><td>Text 1.2</td><td>Text 2.2</td><td>Text 3.2</td></tr></tbody></table>'

Output of collect_env.py

OS version:  Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version:  3.11.4
unstructured version:  0.14.10
unstructured-inference version:  0.7.36
pytesseract version:  0.3.10
Torch version:  2.2.0
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 7.3.7.2 30(Build:2)

As far as I can tell, after digging into the code a bit, it seems the issue comes from the cropping of the image in unstructured.partition.pdf_image.ocr.supplement_element_with_table_extraction which is causing the top border of the table to be cut off. This means the tables_agent is not able to detect the top row as a row, only identifying the 2nd row onwards. Changing it to crop one pixel higher seems to fix the issue.

huangpan2507 commented 4 weeks ago

I have been encountering the same issue with a test PDF I created. The first row of the table is within elements[0].text but not elements[0].metadata.text_as_html. It was using this pdf test_pdf_table.pdf, and the following code.

>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(
...     filename="test_pdf_table.pdf",
...     url=None,
...     infer_table_structure=True,
...     strategy="hi_res",
... )
>>> elements[0].text
'Header 1 Text 1.1 Text 1.2 Header 2 Text 2.1 Text 2.2 Header 3 Text 3.1 Text 3.2'
>>> elements[0].metadata.text_as_html
'<table><tbody><tr><td>Text 1.1</td><td>Text 2.1</td><td>Text 3.1</td></tr><tr><td>Text 1.2</td><td>Text 2.2</td><td>Text 3.2</td></tr></tbody></table>'

Output of collect_env.py

OS version:  Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version:  3.11.4
unstructured version:  0.14.10
unstructured-inference version:  0.7.36
pytesseract version:  0.3.10
Torch version:  2.2.0
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 7.3.7.2 30(Build:2)

As far as I can tell, after digging into the code a bit, it seems the issue comes from the cropping of the image in unstructured.partition.pdf_image.ocr.supplement_element_with_table_extraction which is causing the top border of the table to be cut off. This means the tables_agent is not able to detect the top row as a row, only identifying the 2nd row onwards. Changing it to crop one pixel higher seems to fix the issue.

hi, @alastairmarchant , can you tell me how to solve that problem that you had met before, what the "Changing it to crop one pixel higher seems to fix the issue" mean, how to do it? thanks!

christinestraub commented 3 weeks ago

hi, @alastairmarchant , can you tell me how to solve that problem that you had met before, what the "Changing it to crop one pixel higher seems to fix the issue" mean, how to do it? thanks!

Hi @huangpan2507, it does mean adjusting environment variable TABLE_IMAGE_CROP_PAD e.g.

os.environ["TABLE_IMAGE_CROP_PAD"] = "1"

If you need more accurate table processing results, consider using our API. Document parsing model available through the API is more accurate and incremental improvements to the model will be deployed there. This model is not supported in open source. CC: @alastairmarchant

huangpan2507 commented 3 weeks ago

Hi, @christinestraub, thanks for your kindly help, then, I had another issues about the result of pdf(had english and Chinese word): I use the same code, but the result about Chinese character is different, one time is very good, but another time is very bad, especially when the Chinese characters are on the first line of a page, or at the edge of a page, also, when complex Chinese characters are present . I'm not sure if the environment is the same when running the same code twice, so, which module can cause this effect, and which version of that module is better at recognizing Chinese and English characters. Can you help me?

christinestraub commented 3 weeks ago

@huangpan2507 Can you please provide a pdf document that we could use to reproduce?

huangpan2507 commented 3 weeks ago

@huangpan2507 Can you please provide a pdf document that we could use to reproduce?

Finance-policy.pdf Hi, @christinestraub Here's the document, after I've desensitized the data,and it will be a little different than the one I ran befor, but some issues also exist, especially the result about page1, page5.

some result about page1(English characters)like below: pagecontent='OVerVvieW .ee 2 1 费用 分 类 Payment Categories .4 2 2 请 款 对 象 Persons that Request the Payments .es 3 3 付款 对 象 信息 维护 Recipient information maintenance .pp 3 4 所 需 文件 ** Required Documents .4 3 5 付款 方式 Payment Method .4 4 6 付款 期 限 Payment Terms .4 4 7 报销 期 限 Reimbursement 攻 me limit 4 票 Invoice (FaPiag) 5 1 有 效 发 票 Official Invoice: 5 2 发 票 遗 失 Invoice LOSt** 6 the relevant origin text in pdf are: Overview ................................................................................................................................2 1 费用分类 Payment Categories.............................................................................................2 2 请款对象 Persons that Request the Payments.....................................................................3 3 付款对象信息维护 Recipient information maintenance......................................................3 4 所需文件 Required Documents..........................................................................................3 5 付款方式 Payment Method .................................................................................................4 6 付款期限 Payment Terms....................................................................................................4 7 报销期限 Reimbursement time limit....................................................................................4 发票 Invoice (FaPiao)..............................................................................................................5 1 有效发票 Official Invoice: ....................................................................................................5 2 发票遗失 Invoice Lost..........................................................................................................6

some result about page5(Chinese characters) like below: page_content='Company Name 公司 名 称 : BESCD IRA (kM) APRASI' , page_content='Company Name AS) ZAR: BE CD FA (ACM) AMATAMNDAS', page_content='Company Name 公司 名 称 : 登 CD 技术 (北京 ) 有 限 公 司 天 津 分 公司 Taxpayer ID 44#t AiR SIS: 1234567889D', the relevant origin text in pdf are: Beijing: Company Name 公司名称: 叠登 CD 技术(北京)有限公司 Taxpayer ID 纳税人识别号:1234567889 Wuhan: Company Name 公司名称: 叠登 CD 技术(北京)有限公司武汉分公司 Taxpayer ID 纳税人识别号:1234567889 Tianjing: Company Name 公司名称: 叠登 CD 技术(北京)有限公司天津分公司 Taxpayer ID 纳税人识别号:1234567889D