causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.79k stars 98 forks source link

Save source code `ElementRef::html()` #143

Closed BlackSoulHub closed 10 months ago

BlackSoulHub commented 10 months ago

The string obtained from html(), although equivalent for html markup, is not such in the case of a string. Example:

<table cellspacing="1" border="0" width="100%" class="inf">
// content
</table>

and

<table border="0" class="inf" cellspacing="1" width="100%">
// content
</table>

These 2 lines are equivalent for HTML markup, but they are not suitable for the case of comparing strings (or hashes). Is it possible to change this? Or is there some good option to do this (preferably without using regex)?

adamreichold commented 10 months ago

The internal TreeSink by which this crate integrates with the html5ever HTML does not transport the actual source text and/or positions, so we cannot reproduce them.

You can enable the deterministic feature of this crate to retain the order of the attributes, but this is not the only thing that could change compared to the source text, i.e. to ensure comparing/hashing strings yields stable results.

BlackSoulHub commented 10 months ago

You can enable the deterministic feature of this crate

Thank you! I'll try to do that later.

but this is not the only thing that could change compared to the source text

Give me a hint. What else can change and what should be paid attention to.

BlackSoulHub commented 10 months ago

Can be considered closed. The answer above worked for me.