Add FlagRemoveQuery (strips entire query part)

PuerkitoBio / purell

tiny Go library to normalize URLs

BSD 3-Clause "New" or "Revised" License

473 stars 58 forks source link

Add FlagRemoveQuery (strips entire query part) #16

Closed bcampbell closed 8 years ago

bcampbell commented 8 years ago

I won't be at all offended if you think this flag has no place in purell, but it's one I would use almost exclusively :- )

context: I'm dealing with lots of news article URLs, and the unique part of the URL is almost always a slug, and the query used only for tracking (eg ?from=rssfeed ). Stripping the query is always my first step when normalizing URLs for comparison.

eg (apologies for dailymail link): http://www.dailymail.co.uk/news/article-3656841/It-s-Democrats-end-25-hour-sit-Congress-gun-control-walk-cheering-crowd-without-vote-demanded.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490

Just one of the many frustrations when trying to decide if two URLs are referring to the same article - there are loads more. Newspaper sites can commit some real atrocities when dealing with URLs!

mna commented 8 years ago

Hello Ben,

Thanks for the PR, it looks good! I'm unsure at the moment if it feels right to have this in the package. One thing is the helper flags FlagsAllGreedy and FlagsAllNonGreedy, those were poorly thought-out and kind of prevent adding new flags in a sensible way (either you break the old behaviour of those flags by or-ing the new one, or you break the semantics by leaving it out - it's not "All" anymore).

Also, I would think a common need may be to remove some well-known, referral-type query strings, while keeping others? The new flag wouldn't help in that case. I'm thinking out loud here, but maybe all that's missing is a FlagCustom or a new NormalizeURLFunc(*url.URL, NormalizationFlags, func(*url.URL)) that calls the func after applying the provided normalization flags, so you can do what you want with the *url.URL before returning the normalized string.

I'll think about that a bit more, and feel free let me know what you think about this.

Martin

bcampbell commented 8 years ago

Instead of custom flags or a NormalizeURLFunc() with a callback, how about a nice simple:

func Normalize(u *url.URL, f NormalizationFlags) *url.URL

Which returns a *url.URL, upon which the caller can perform their own further custom munging (with a nearby note explaining why they should use urlesc.Escape() rather than the net/url equivalent :- )

I know you're keen to keep the API minimal, so I think that'd be the most orthogonal way to handle custom munging...

bcampbell commented 8 years ago

I get what you mean about FlagsAllGreedy and FlagsAllNonGreedy. Actually, I think that's as good a reason as any to reject incompatible flags. A bit arbitrary, maybe, but hey.

Regarding FlagRemoveQuery vs query filtering blacklist/whitelist... I can only really talk about my own use case, which is dealing with news and blog sites in bulk: Yes, there are some well-known referral-type strings (like the google "utm_*" query tags), but sooooo many sites define their own ones too, which screw up my url-comparison code. Maybe 1 in 100 sites have article URLs with significant query data (eg: http://example.com?article_id=12345), but for the the other 99, the query part is always extraneous and needs to be stripped.

So, for my needs, I always start from the assumption that the query part of the URL is just noise.

mna commented 8 years ago

Instead of custom flags or a NormalizeURLFunc() with a callback, how about a nice simple: func Normalize(u url.URL, f NormalizationFlags) url.URL

Very true, in fact the existing NormalizeURL func already works for this, even though it returns the normalized string. The *url.URL value is modified in-place, so the URL passed in is normalized after the call, and you can then remove the query string (u.RawQuery = "") if you want to ignore that part. The urlesc package was temporary and that call is not required since Go1.5 (see #14).

So with that said, given that the proposed flag is a one-liner that can be done quite easily with the current API, I think I'll pass on the PR.

Thanks, Martin