Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.43k stars 692 forks source link

`partition_html` fails when it finds partially correct HTML code #3426

Open Tibiritabara opened 1 month ago

Tibiritabara commented 1 month ago

I am currently partitioning a docx file harnessing unstructured with the next input params:

{
  "filename": "document.docx",
  "response_type": "application/json",
  "coordinates": false,
  "encoding": "utf-8",
  "hi_res_model_name": null,
  "include_page_breaks": false,
  "ocr_languages": null,
  "pdf_infer_table_structure": true,
  "skip_infer_table_types": ["pdf"],
  "strategy": "auto",
  "xml_keep_tags": false,
  "languages": null,
  "extract_image_block_types": null,
  "unique_element_ids": false,
  "chunking_strategy": "by_title",
  "combine_under_n_chars": null,
  "max_characters": 500,
  "multipage_sections": true,
  "new_after_n_chars": null,
  "overlap": 0,
  "overlap_all": false,
  "starting_page_number": null,
}

This returns a set of documents that I am further processing to extract found html tables into dataframes, using the partition_html function:

from unstructured.partition.html import partition_html

elements = partition_html(
    text=html_table,
    **self.partitioning_parameters,  # type: ignore
)

In some ocassions, unstructured returns an invalid html for the table, i.e.:

text = '                                </td></tr>\n</tbody>\n</table>'

When this happens, the partition_html method fails, as this method returns NoneType

document_tree = etree.fromstring(html_text.encode("utf-8"), html_parser) 

There should be an exception handling for when the chunking throws incorrect HTML and this method fails.

scanny commented 1 month ago

@Tibiritabara What version of unstructured are you using? There were some very recent changes to the HTML parser in version 0.15.0 so first thing to check would be that.

Tibiritabara commented 1 month ago

Thank you so much for your follow-up. I am familiar with the latest changes, as I also went through the code differences when debugging locally. I am currently harnessing v0.15.0. The issue is, there is no exception handling when etree returns a NoneType. The only exception handling does try to re-execute etree HTML parsing, and fails again.

scanny commented 1 month ago

@Tibiritabara I'm not getting the sequence of events here.

Are you saying:

If so we should take a look at how invalid HTML is produced for .text_as_html, although perhaps improvements to partition_html() when the HTML can't be parsed are possible.

Can you clarify and can you provide an example document (shorter the better) that reproduces this behavior?

huangpan2507 commented 1 month ago

I am currently partitioning a docx file harnessing unstructured with the next input params:

{
  "filename": "document.docx",
  "response_type": "application/json",
  "coordinates": false,
  "encoding": "utf-8",
  "hi_res_model_name": null,
  "include_page_breaks": false,
  "ocr_languages": null,
  "pdf_infer_table_structure": true,
  "skip_infer_table_types": ["pdf"],
  "strategy": "auto",
  "xml_keep_tags": false,
  "languages": null,
  "extract_image_block_types": null,
  "unique_element_ids": false,
  "chunking_strategy": "by_title",
  "combine_under_n_chars": null,
  "max_characters": 500,
  "multipage_sections": true,
  "new_after_n_chars": null,
  "overlap": 0,
  "overlap_all": false,
  "starting_page_number": null,
}

This returns a set of documents that I am further processing to extract found html tables into dataframes, using the partition_html function:

from unstructured.partition.html import partition_html

elements = partition_html(
    text=html_table,
    **self.partitioning_parameters,  # type: ignore
)

In some ocassions, unstructured returns an invalid html for the table, i.e.:

text = '                                </td></tr>\n</tbody>\n</table>'

When this happens, the partition_html method fails, as this method returns NoneType

document_tree = etree.fromstring(html_text.encode("utf-8"), html_parser) 

There should be an exception handling for when the chunking throws incorrect HTML and this method fails.

hi, @Tibiritabara , I want to know which function and which version had these parameters? Does is belong to the API function? Thanks

Tibiritabara commented 1 month ago

Dear Team,

@scanny answering your questions:

  • You are partitioning DOCX documents that contain tables.

Yes, that is the case

-The Table elements produced include .metadata.text_as_html fields that are sometimes invalid.

Indeed

  • You re-partition these HTML snippets and it fails?

Yes.

My intention is to harness the text_as_html field from the metadata to create dataframes. As I know that the text_as_html might not always be in shape, or might include noise, I do repartition with partition_html

Regarding a possible document, I am unfortunately unable to share the source document for this issue. I will try to create a new document clean of confidential data and reproduce the issue to share it with you.


@huangpan2507 answering your questions:

I am using unstructured API to extract the nodes from the documents. The current API version I am using is v0.0.73. This is the docker image I am pulling:

docker pull quay.io/unstructured-io/unstructured-api:0.0.73

To repartition, I am using unstructured 0.15.0


Thank you so much everyone for your support.