ameenmaali / urldedupe

Pass in a list of URLs with query strings, get back a unique list of URLs and query string combinations
MIT License
331 stars 53 forks source link

Move to hashing instead of generating URL keys #17

Closed larskraemer closed 12 months ago

larskraemer commented 4 years ago

Instead of producing a string for each URL to use as a key, it is sensible to produce a hash. I pulled in the SpookyV2 hash library, which is reasonably compact and public domain, and produces 128-bit hashes. We definitely need a 128-bit or greater hash, as a 64-bit hash will probably collide after about 2^32 inputs, which would be about 256 GB of URLs with an average length of 64 bytes. This may have been fine, but to be safe, a 128-bit hash function lets us process ~1'000'000 PB of URLs, which is definitely fine. I also changed the is_asset and is_number functions to accept string_views, since that will work with both strings and string_views.

This PR depends on my last one. That is my mistake, and I don't know how to fix it.. Let me know if you need me to do something about that.

larskraemer commented 4 years ago

I updated this with a better input method which requires fewer allocations. @ameenmaali

ameenmaali commented 4 years ago

Hey @larskraemer, thanks so much for the work here :D - I have been super busy/offline the past couple weeks as you've seen. I'll take a look at this when I get back to it in a few days. As a side note/first glance, is there any solution that doesn't require 3rd party code? That was one of my goals building this (hence why a lot of stuff could probably be easily done in an existing library). I've checked out the std::hash lib a bit, but not sure if it's got some issues. No worries if you think this is better, this looks pretty solid to me at the moment.

larskraemer commented 4 years ago

@ameenmaali I noticed :D As for not using third party libraries; As mentioned above, we probably want a 128 bit hash. I don't think the standard library provides one, so we're stuck with either writing our own, or using somebody else's solution. As for writing your own hash function, that's hard, and the result will probably be pretty bad (not because you or I are dumb, but because, as mentioned, it is hard to write a good hash function). I get that this is a learning project, but in this case, I think the best lesson is probably "Don't write your own hash function" :D