Is it possible to read an element's contents when it's encountered?

cloudflare / lol-html

Low output latency streaming HTML parser/rewriter with CSS selector-based API

https://crates.io/crates/lol-html

BSD 3-Clause "New" or "Revised" License

1.46k stars 80 forks source link

Is it possible to read an element's contents when it's encountered? #140

Closed gjtorikian closed 2 years ago

gjtorikian commented 2 years ago

I am guessing, because of the way the parser is architected, that this is a big No (I've been poking at it for a week or so without much luck).

Here's my use case. Say I encounter this piece of HTML:

<h1>A heading!</h1>

I want my element_handler to call element.prepend to transform this into:

<a id="a-heading"><h1>A heading!</h1>

so that I can link to the heading as part of a TOC (#a-heading). But in order to do this, I would need to get the text contents of the h1 element first.

Like I said I'm fairly certain this is not possible given the parser's design, but I'm hoping to be wrong!

gjtorikian commented 2 years ago

Humble bump here 🙏

jyn514 commented 2 years ago

You can read the text contents, and you can modify a, but you can't read the text contents before you see a. Doing so would require us to buffer arbitrarily large amounts of HTML.

gjtorikian commented 2 years ago

Is there any way for the text_handler (or I suppose, each text chunk) to know its parent element, though?

jyn514 commented 2 years ago

No, that also requires arbitrary buffering because there can be multiple text chunks within an element.

It may be useful to see our blog post about why lol-html was originally written: https://blog.cloudflare.com/html-parsing-1/ The basic idea is that any incoming data can have arbitrary delay, and we want to buffer as little as possible (both to save memory, and to decrease latency for the client), so we only expose information that can be done independent of other incoming packets.

jyn514 commented 2 years ago

If you need this functionality I suggest using html5ever and kuchiki instead; they have substantially worse performance but a much more flexible API.

bglw commented 2 years ago

+1 for kuchiki if you really need to make updates to a document as a whole based on later contents. If you don't need to update the parent element, and you just need the text handler to know its parent element:

I have had luck with an element handler that pushes some info to a stack, an end tag handler that pops the stack, and a text handler that can then infer what element it must be reading from. Essentially tracking a linear selector path alongside lol-html parsing.

gjtorikian commented 2 years ago

I have had luck with an element handler that pushes some info to a stack, an end tag handler that pops the stack, and a text handler that can then infer what element it must be reading from. Essentially tracking a linear selector path alongside lol-html parsing.

Ohhh, I'll try this before pursuing another crate. Thanks y'all.

bglw commented 2 years ago

@gjtorikian ping me here or elsewhere if you want some code samples — my implementation is a little bit of a pile-on mess but contains the meat of it. Specifically this bit might catch you out: https://github.com/CloudCannon/pagefind/blob/main/pagefind/src/fossick/parser.rs#L256-L259

(Adding an end tag handler for something like an img will error, so you need to either skip pushing that to the stack or immediately pop)

gjtorikian commented 2 years ago

Reporting back to say that keeping a stack of elements observed works great for my case. Thanks for the suggestion, @bglw !