James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
31 stars 4 forks source link

Cannot parse most sites in php #9

Closed nerdunit closed 2 years ago

nerdunit commented 2 years ago

The html parser will print "Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol / Unknown HTML symbol /" On most php based website, example code

    let url = "https://www.ladybirdeducation.co.uk/";
    let client = reqwest::blocking::Client::new();
    let res = client
        .get(url).send()?;
    let body = res.text()?;
    println!("body:{}",(&body).as_str());
    let document = html::parse(body.as_str())?;
    println!("parsed");
    Ok(())
James-LG commented 2 years ago

PHP is rendered into HTML server-side so your error must be from something else. Regardless I will investigate based on the URL you gave.

James-LG commented 2 years ago

I believe your issue was actually related to a < symbol in a Githubissues.

  • Githubissues is a development platform for aggregating issues.