ameenmaali / urldedupe

Pass in a list of URLs with query strings, get back a unique list of URLs and query string combinations
MIT License
331 stars 53 forks source link

Simplified and optimized the parsers; eliminate decoding for now #8

Closed larskraemer closed 4 years ago

larskraemer commented 4 years ago

Right, this is a pretty big PR. Firstly, I deleted the decoding logic in the parsers, since decoding the URLs seems to create more bugs than it solves. decode is now also a static member of Url, since this allows us to use it to decode any part of the URL, when and if we decide to do so. Also, I removed the is_encoded function, since checking if a string is encoded is not much faster than actually decoding it. encode is gone, too, as it probably should have been since #2. We should never need to re-encode a URL, because we saved the original version of it. Also, encoding a substring of the URL depends on where in the URL it is located, and I feel doing this properly is going to be a lot of code for functionality we do not expect to use. The new parse function is just a bit of a cleaner (and faster, thanks to std::string_view) version of the original one. I used ~4000 links pulled from Wikipedia (for a good mix of links containing any combination of query strings, paths and fragments) to test it against regex_parse, and except for some pathological examples, both methods yield the same results.

ameenmaali commented 4 years ago

Thanks @larskraemer! I'll test this out later. Looks good at first glance, but can you keep the encode() and is_encoded() methods? Maybe just as static methods as you did with decode. There's really no need for it in this project as you've mentioned, but I'd like to keep in for now in case needed in the future, or when extracting the Url class potentially for other purposes. Thanks!

larskraemer commented 4 years ago

No problem. I modified encode a bit to make it a little more... C++