Closed gjtorikian closed 2 years ago
Humble bump here 🙏
You can read the text contents, and you can modify a
, but you can't read the text contents before you see a
. Doing so would require us to buffer arbitrarily large amounts of HTML.
Is there any way for the text_handler
(or I suppose, each text chunk) to know its parent element, though?
No, that also requires arbitrary buffering because there can be multiple text chunks within an element.
It may be useful to see our blog post about why lol-html was originally written: https://blog.cloudflare.com/html-parsing-1/ The basic idea is that any incoming data can have arbitrary delay, and we want to buffer as little as possible (both to save memory, and to decrease latency for the client), so we only expose information that can be done independent of other incoming packets.
If you need this functionality I suggest using html5ever
and kuchiki
instead; they have substantially worse performance but a much more flexible API.
+1 for kuchiki
if you really need to make updates to a document as a whole based on later contents. If you don't need to update the parent element, and you just need the text handler to know its parent element:
I have had luck with an element handler that pushes some info to a stack, an end tag handler that pops the stack, and a text handler that can then infer what element it must be reading from. Essentially tracking a linear selector path alongside lol-html
parsing.
I have had luck with an element handler that pushes some info to a stack, an end tag handler that pops the stack, and a text handler that can then infer what element it must be reading from. Essentially tracking a linear selector path alongside
lol-html
parsing.
Ohhh, I'll try this before pursuing another crate. Thanks y'all.
@gjtorikian ping me here or elsewhere if you want some code samples — my implementation is a little bit of a pile-on mess but contains the meat of it. Specifically this bit might catch you out: https://github.com/CloudCannon/pagefind/blob/main/pagefind/src/fossick/parser.rs#L256-L259
(Adding an end tag handler for something like an img will error, so you need to either skip pushing that to the stack or immediately pop)
Reporting back to say that keeping a stack of elements observed works great for my case. Thanks for the suggestion, @bglw !
I am guessing, because of the way the parser is architected, that this is a big No (I've been poking at it for a week or so without much luck).
Here's my use case. Say I encounter this piece of HTML:
I want my
element_handler
to callelement.prepend
to transform this into:so that I can link to the heading as part of a TOC (
#a-heading
). But in order to do this, I would need to get the text contents of theh1
element first.Like I said I'm fairly certain this is not possible given the parser's design, but I'm hoping to be wrong!