causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.81k stars 100 forks source link

Selector::parse transforms > etc. into > etc. #78

Closed Temeez closed 2 years ago

Temeez commented 2 years ago

The following code produces the following prints.

    let data = "<html><body><p>Foo bar &lt;&gt; Cake ! Heh &le;</p></body></html>";
    let html = Html::parse_document(&data);
    let selector = Selector::parse("p").unwrap();
    println!("{:?}", data);
    println!("{:?}", html.select(&selector).next());
"<html><body><p>Foo bar &lt;&gt; Cake ! Heh &le;</p></body></html>"
Some(ElementRef { node: NodeRef { id: NodeId(5), tree: Tree { vec: [Node { parent: None, prev_sibling: None, next_sibling: None, children: Some((NodeId(2), NodeId(2))), value: Document }, Node { parent: Some(NodeId(1)), prev_sibling: None, next_sibling: None, children: Some((NodeId(3), NodeId(4))), value: Element(<html>) }, Node { parent: Some(NodeId(2)), prev_sibling: None, next_sibling: Some(NodeId(4)), children: None, value: Element(<head>) }, Node { parent: Some(NodeId(2)), prev_sibling: Some(NodeId(3)), next_sibling: None, children: Some((NodeId(5), NodeId(5))), value: Element(<body>) }, Node { parent: Some(NodeId(4)), prev_sibling: None, next_sibling: None, children: Some((NodeId(6), NodeId(6))), value: Element(<p>) }, Node { parent: Some(NodeId(5)), prev_sibling: None, next_sibling: None, children: None, value: Text("Foo bar <> Cake ! Heh ≤") }] }, node: Node { parent: Some(NodeId(4)), prev_sibling: None, next_sibling: None, children: Some((NodeId(6), NodeId(6))), value: Element(<p>) } } })

Issue being this part: Text("Foo bar <> Cake ! Heh ≤") It should be Text("Foo bar &lt;&gt; Cake ! Heh &le;")

Is this a bug?

adumbidiot commented 2 years ago

Probably https://developer.mozilla.org/en-US/docs/Glossary/Entity?

causal-agent commented 2 years ago

The entities are being parsed, as they should be. If you want the text serialized back to HTML, try .inner_html().

Temeez commented 2 years ago

Hmm yes, the .inner_html() gives "Foo bar &lt;&gt; Cake ! Heh ≤". It didn't handle the &le; though so bug maybe?

Anyways I figured out what to do so thank you!

causal-agent commented 2 years ago

Working as intended