KWARC / rust-libxml

Rust wrapper for libxml2
https://crates.io/crates/libxml
MIT License
76 stars 38 forks source link

No error on malformed html? #14

Closed Bertus-W closed 6 years ago

Bertus-W commented 7 years ago

Is it right that the library doesn't give an error on malformed html? I have a website I was trying to scrape, but I wasn't able to find my elements with xpath. I managed to succeed on the same website using pythons beautifulsoup in combination with lxml.

dginev commented 7 years ago

I think you have hit a part of the wrapper that is still unstable - as most of it is - and I have a comment that suggests exactly what you have seen: https://github.com/KWARC/rust-libxml/blob/master/src/parser.rs#L145

There is a method you can call to check for validity, but there won't be an error thrown currently. You are most welcome to file a PR with your desired behavior, help is very welcome!

dginev commented 6 years ago

Hi @Catman155 , trying to figure out what to do with this issue.

Did you try running the same code after you passed the page through beautiful soup? As in comparing the rust libxml wrapper to the Python lxml?

Libxml can be peculiar in how it decides to recover from errors, and as mentioned above the current parser settings are rather minimal - we have not fleshed out the full API there for the moment.

Unsure what to do here since we don't really have a roadmap planned for the crate - largely volunteer driven, so if you're not interested in this any more a year later, will just close with "wontfix" here for now.

dginev commented 6 years ago

Closing here, but feel free to leave a comment/reopen!