James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
31 stars 4 forks source link

Fail to parse html #49

Open NightBlaze opened 2 months ago

NightBlaze commented 2 months ago

Comparing different libraries for parsing HTML and found that Skyscraper fails in some cases when other (sxd_html or one on Swift) works fine.

let link = "https://livejournal.com/";
let response = reqwest::blocking::get(link).expect("load url error");
let html_text = response.text().expect("get html text");
let document = skyscraper::html::parse(&html_text).expect("parse html");

returns parse html: EndTagMismatch { end_name: "svg", open_name: "symbol" }

James-LG commented 2 months ago

I'm currently working on a rewrite of the HTML module. It will follow the official HTML standard as defined by https://html.spec.whatwg.org/multipage/parsing.html. Hopefully that will solve your issues.

It's a lot of work though so I don't really have an ETA - depends how much free time I get.