Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.62k stars 704 forks source link

bug(html): form and form controls are not ignored #3247

Closed scanny closed 2 months ago

scanny commented 3 months ago

Summary partition_html() produces document elements for the contents of <form> elements, including form controls like <textarea>.

To Reproduce

html_text = """
<body>
  <form>
    <p>
      <label for="filename">Filename</label><br />
      <input type="text" name="filename" value="testfile.txt" id="filename" />
    </p>
    <p>
      <label for="data">File Contents</label><br />
      <textarea cols="60" rows="10" name="data" id="data">
        Whatever you put in this text box will be downloaded and saved in the file.
        If you leave it blank, no file will be downloaded.
      </textarea>
    </p>
    <p id="downloadify">You must have Flash 10 installed to download this file.</p>
  </form>
</body
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[]

Actual:

[
  {
    "element_id": "9f4025fc69bda08e8b0f69052d615c6a",
    "metadata": {
      "category_depth": 0,
      "filetype": "text/html",
      "languages": [
        "eng"
      ]
    },
    "text": "Filename",
    "type": "Title"
  },
  {
    "element_id": "c5de569ff8ea14341c6835380ae221d3",
    "metadata": {
      "category_depth": 0,
      "filetype": "text/html",
      "languages": [
        "eng"
      ]
    },
    "text": "File Contents",
    "type": "Title"
  },
  {
    "element_id": "97e875b3ace86b3b5c23dd0facaec339",
    "metadata": {
      "filetype": "text/html",
      "languages": [
        "eng"
      ],
      "parent_id": "c5de569ff8ea14341c6835380ae221d3"
    },
    "text": "Whatever you put in this text box will be downloaded and saved in the file.\n        If you leave it blank, no file will be downloaded.",
    "type": "NarrativeText"
  },
  {
    "element_id": "af8b3b8e50388aefd0ed98c0d74b66aa",
    "metadata": {
      "filetype": "text/html",
      "languages": [
        "eng"
      ],
      "parent_id": "c5de569ff8ea14341c6835380ae221d3"
    },
    "text": "You must have Flash 10 installed to download this file.",
    "type": "NarrativeText"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.

heralight commented 3 months ago

I confirm that #3218 ignore form, but not the revamp https://github.com/Unstructured-IO/unstructured/pull/3257 and perhaps add flag to partition_html to ignore or not form ? best regards,

scanny commented 3 months ago

@heralight #3257 adds the new parser but does not install it. So partition_html() still uses the old parser in #3257. This was just to make #3218 more reviewable when rebased onto it since it will contain a lot of test changes that reflect the fixed behaviors. That rebase should be happening in the next hour or two.

Regarding including <form> content. Can you describe the use-case you have where you'd like to include that? In general, adding parameters is pretty expensive so we'd need to want it pretty bad to add a parameter specific to that.

Expensive not in the sense of initial development cost so much but they need to be added to the broader set supported by the API and SDK and they make using the system harder to learn for folks etc., so maintenance and overall system complexity.

We could consider maybe a bitmask parameter that allowed changing default inclusion behaviors, like exclude_html=EH_FORM | EH_HEADER | EH_FOOTER | EH_FIGURE

heralight commented 3 months ago

@scanny generally you want to parse html page but ignore inputs like form, validation texts, etc. for example: https://www.pap.fr/annonces/appartement-bordeaux-33800-r449601705

you want ignore part like the contact sidebars and bottom. you you don't, you will have "incorrect email" text item for example, and it has too much noise. And the actual parser didn't link parents correctly in this case or put dom type to post filtering.

from my point of view, we can replace skip_headers_and_footers argument by something like skip_dom_types a list of string to skip defined dom type, up to the user to specify correctly like "form". and call like:

  if self._opts.skip_dom_types:
            etree.strip_elements(root, self._opts.skip_dom_type, with_tail=False)
scanny commented 2 months ago

Hmm. I like that idea. That gives a lot more flexibility, like strip headers but not footers, also ignore nav elements maybe, etc. :)