cloudflare / lol-html

Low output latency streaming HTML parser/rewriter with CSS selector-based API
https://crates.io/crates/lol-html
BSD 3-Clause "New" or "Revised" License
1.47k stars 82 forks source link

Fails to rewrite simple html image tag #184

Closed lmtr0 closed 9 months ago

lmtr0 commented 1 year ago

Hello there, I've setted up a simple rewrite of src for imgs. However, the parser does not rewrite it. I've noticed that it's not on all websites. One of the websites I'm rewriting is https://www.brybelly.com/home. I can simplify my code to:

// html is the webpage from https://www.brybelly.com/home
 let new_html = rewrite_str(&html, Settings {
        element_content_handlers: vec![
            element!("img", |el| {
                let Some(src) = el.get_attribute("src") else {
                    log::error!("img: Skiping image, no src attribute");
                    return Ok(())
                };
                log::error!("img[src] {src} being processed");
            })
        ],
// .. other configs
})

It does not print anything for the img.

lmtr0 commented 1 year ago

here is the pastebin of the html file: https://pastebin.com/eHej5bG2

lmtr0 commented 1 year ago

it seems to be because the img elements are inside a noscript tag. Why is that?

bglw commented 9 months ago

It looks like lol-html currently parses the <noscript> element as RawText, along with the usual candidates like <style> and <iframe>. That's the correct behavior when parsing/tokenizing HTML with the scripting flag enabled.

It begs a question whether lol-html should be parsing with the scripting flag enabled or disabled. I could see arguments either way, and it would at least be a useful option to have.

I don't know if there would be any real ramifications to turning the scripting flag on — from what I can see it only affects tokenization/parsing of the <noscript> element. The main downside would be breaking/confusing the parser if the <noscript> contains garbage — but since it should contain HTML in the case where scripting is disabled, valid HTML should be a reasonable expectation here.

lmtr0 commented 9 months ago

Got it, thx for the explanation :)