Open Tibiritabara opened 4 months ago
@Tibiritabara What version of unstructured
are you using? There were some very recent changes to the HTML parser in version 0.15.0 so first thing to check would be that.
Thank you so much for your follow-up. I am familiar with the latest changes, as I also went through the code differences when debugging locally. I am currently harnessing v0.15.0. The issue is, there is no exception handling when etree
returns a NoneType
. The only exception handling does try to re-execute etree
HTML parsing, and fails again.
@Tibiritabara I'm not getting the sequence of events here.
Are you saying:
Table
elements produced include .metadata.text_as_html
fields that are sometimes invalid.If so we should take a look at how invalid HTML is produced for .text_as_html
, although perhaps improvements to partition_html()
when the HTML can't be parsed are possible.
Can you clarify and can you provide an example document (shorter the better) that reproduces this behavior?
I am currently partitioning a docx file harnessing unstructured with the next input params:
{ "filename": "document.docx", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": ["pdf"], "strategy": "auto", "xml_keep_tags": false, "languages": null, "extract_image_block_types": null, "unique_element_ids": false, "chunking_strategy": "by_title", "combine_under_n_chars": null, "max_characters": 500, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": null, }
This returns a set of documents that I am further processing to extract found
html
tables into dataframes, using thepartition_html
function:from unstructured.partition.html import partition_html elements = partition_html( text=html_table, **self.partitioning_parameters, # type: ignore )
In some ocassions, unstructured returns an invalid html for the table, i.e.:
text = ' </td></tr>\n</tbody>\n</table>'
When this happens, the
partition_html
method fails, as this method returnsNoneType
document_tree = etree.fromstring(html_text.encode("utf-8"), html_parser)
There should be an exception handling for when the chunking throws incorrect HTML and this method fails.
hi, @Tibiritabara , I want to know which function and which version had these parameters? Does is belong to the API function? Thanks
Dear Team,
@scanny answering your questions:
- You are partitioning DOCX documents that contain tables.
Yes, that is the case
-The Table elements produced include .metadata.text_as_html fields that are sometimes invalid.
Indeed
- You re-partition these HTML snippets and it fails?
Yes.
My intention is to harness the text_as_html field from the metadata to create dataframes. As I know that the text_as_html might not always be in shape, or might include noise, I do repartition with partition_html
Regarding a possible document, I am unfortunately unable to share the source document for this issue. I will try to create a new document clean of confidential data and reproduce the issue to share it with you.
@huangpan2507 answering your questions:
I am using unstructured API to extract the nodes from the documents. The current API version I am using is v0.0.73. This is the docker image I am pulling:
docker pull quay.io/unstructured-io/unstructured-api:0.0.73
To repartition, I am using unstructured 0.15.0
Thank you so much everyone for your support.
I am currently partitioning a docx file harnessing unstructured with the next input params:
This returns a set of documents that I am further processing to extract found
html
tables into dataframes, using thepartition_html
function:In some ocassions, unstructured returns an invalid html for the table, i.e.:
When this happens, the
partition_html
method fails, as this method returnsNoneType
There should be an exception handling for when the chunking throws incorrect HTML and this method fails.