cloudflare / lol-html

Low output latency streaming HTML parser/rewriter with CSS selector-based API
https://crates.io/crates/lol-html
BSD 3-Clause "New" or "Revised" License
1.47k stars 82 forks source link

How to extract inner HTML of an element? #212

Open Jared-Sprague opened 7 months ago

Jared-Sprague commented 7 months ago

Hello!

I'm trying to use lol_html as an HTML parser to extract all content within an element, it's inner HTML. However I haven't figured out the right methods for this yet. Here is what I want to do, given the following HTML snippet:

<article id="main-content">
  <div>
    <p>foo</p>
  </div>
</article>

I want to extract all the content within the element matching css selector article#main-content and store it in a String named content after this runs the value of content will be equal to:

  <div>
    <p>foo</p>
  </div>

I've what would be perfect would be a method on the element called get_inner_content() there seems to be all the methods for manipulating the inner content but not actually getting it such as: set_inner_content remove_and_keep_content

Maybe I'm thinking about this wrong, any help just getting the inner content of an element would be so hepful!

bglw commented 7 months ago

I'm just a user, not a maintainer here, but I can at least answer and say that there isn't a good way to achieve this.

Everything in lol-html revolves around it being a streaming HTML transformer, and as a result it doesn't hold on to that stream for very long as it passes through. set_inner_content is trivial, since it can just ignore its stream for a while and use the provided content, and likewise remove_and_keep_content doesn't block the stream since it just removes some of the values passing through.

Anything like get_inner_content though would require buffering an arbitrary amount of data from the stream, which would inherently stop it from streaming. If you called get_inner_content on the html element itself, lol-html would cease to be a streaming parser and would have to store the whole document.

There are some solutions for inner text, by adding a text! handler and appending the text to a buffer yourself, but there isn't an equivalent for HTML as far as I'm aware. You would have to create an element! handler and re-serialize the tag/attributes yourself, alongside the text! handler for the text nodes.

The only viable path I can think of would be to use the el.prepend() and el.append() methods to insert some delimiters into the stream, then processing that yourself afterward to extract the innerHTML between the delimiters.

For anything more involved, I'd look at using kuchiki(ki) instead.

mwcz commented 7 months ago

Related #40 #78