Closed scanny closed 2 months ago
I confirm that #3218 ignore form, but not the revamp https://github.com/Unstructured-IO/unstructured/pull/3257 and perhaps add flag to partition_html to ignore or not form ? best regards,
@heralight #3257 adds the new parser but does not install it. So partition_html()
still uses the old parser in #3257. This was just to make #3218 more reviewable when rebased onto it since it will contain a lot of test changes that reflect the fixed behaviors. That rebase should be happening in the next hour or two.
Regarding including <form>
content. Can you describe the use-case you have where you'd like to include that? In general, adding parameters is pretty expensive so we'd need to want it pretty bad to add a parameter specific to that.
Expensive not in the sense of initial development cost so much but they need to be added to the broader set supported by the API and SDK and they make using the system harder to learn for folks etc., so maintenance and overall system complexity.
We could consider maybe a bitmask parameter that allowed changing default inclusion behaviors, like exclude_html=EH_FORM | EH_HEADER | EH_FOOTER | EH_FIGURE
@scanny generally you want to parse html page but ignore inputs like form, validation texts, etc. for example: https://www.pap.fr/annonces/appartement-bordeaux-33800-r449601705
you want ignore part like the contact sidebars and bottom. you you don't, you will have "incorrect email" text item for example, and it has too much noise. And the actual parser didn't link parents correctly in this case or put dom type to post filtering.
from my point of view, we can replace skip_headers_and_footers argument by something like skip_dom_types a list of string to skip defined dom type, up to the user to specify correctly like "form". and call like:
if self._opts.skip_dom_types:
etree.strip_elements(root, self._opts.skip_dom_type, with_tail=False)
Hmm. I like that idea. That gives a lot more flexibility, like strip headers but not footers, also ignore nav
elements maybe, etc. :)
Summary
partition_html()
produces document elements for the contents of<form>
elements, including form controls like<textarea>
.To Reproduce
Expected:
Actual:
Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.