causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.81k stars 100 forks source link

selector can't match 'body' #79

Closed TsuITOAR closed 2 years ago

TsuITOAR commented 2 years ago

Hi, I'm trying to match in something like this

<article class="out">
    <body>
        <article class="in">

        </article>
    </body>
</article>

with code

use scraper::{Html, Selector};

type Result<O> = std::result::Result<O, Box<dyn std::error::Error>>;

const HTML: &str = r#"
<article class="out">
    <body>
        <article class="in">

        </article>
    </body>
</article>
"#;

fn main() -> Result<()> {
    let html: Html = Html::parse_document(HTML);
    let selector = Selector::parse(r" article > body > article").unwrap();
    html.select(&selector).for_each(|x| println!("\n-----\n{}\n-----\n", x.html()));

    Ok(())
}

article > body > article can't match anything

but article > article output

-----
<article class="in">
        </article>
-----

parse_document and parse_fragment give the same results.

causal-agent commented 2 years ago

<body> isn't valid anywhere but as the second child of <html> so the HTML parser is dropping it

TsuITOAR commented 2 years ago

<body> isn't valid anywhere but as the second child of <html> so the HTML parser is dropping it

Maybe selector should handle it while building from a str to match this situation? The content is the response to an api call in loading a dynamic webpage, which should be quite common.

causal-agent commented 2 years ago

Neither is up to scraper. It's behaving as a browser would