causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.81k stars 100 forks source link

Use AHash instead of SipHash for storing attributes and classes. #117

Closed adamreichold closed 1 year ago

adamreichold commented 1 year ago

This has better performance while still being DoS resistant.

This PR is the hasher change broken out from #91 as that part has no interactions with #101 and appears straight forward enough.

j-mendez commented 1 year ago

@adamreichold I use these changes on another project and it works solid. Changes lg2m too.

j-mendez commented 1 year ago

Hashbrown could fit well too depending on env (not the core from std) @adamreichold . That’s what I use with ahash. For the project using one hash made more sense due to the allocators.

adamreichold commented 1 year ago

Hashbrown could fit well too depending on env (not the core from std) @adamreichold . That’s what I use with ahash. For the project using one hash made more sense due to the allocators.

There is little difference here in the end as std's HashMap is Hashbrown and Hashbrown uses AHash as its default hasher. So in the end this would result in the same code being used, only in possibly slightly different versions. That being the case, I chose AHash as it is the smaller dependency compared to Hashbrown (switching just the hasher instead of adding another copy of the same hash table implementation).

cfvescovo commented 1 year ago

LGTM. What should we do with #91?

adamreichold commented 1 year ago

LGTM. What should we do with #91?

Decide whether we want all or parts of #101, then if classes are made to be collected lazily drop the switch to String or if we keep collecting them eagerly merge it.

j-mendez commented 1 year ago

Hashbrown could fit well too depending on env (not the core from std) @adamreichold . That’s what I use with ahash. For the project using one hash made more sense due to the allocators.

There is little difference here in the end as std's HashMap is Hashbrown and Hashbrown uses AHash as its default hasher. So in the end this would result in the same code being used, only in possibly slightly different versions. That being the case, I chose AHash as it is the smaller dependency compared to Hashbrown (switching just the hasher instead of adding another copy of the same hash table implementation).

Hashbrown could fit well too depending on env (not the core from std) @adamreichold . That’s what I use with ahash. For the project using one hash made more sense due to the allocators.

There is little difference here in the end as std's HashMap is Hashbrown and Hashbrown uses AHash as its default hasher. So in the end this would result in the same code being used, only in possibly slightly different versions. That being the case, I chose AHash as it is the smaller dependency compared to Hashbrown (switching just the hasher instead of adding another copy of the same hash table implementation).

Valid, great explanation 🙌.