Closed bcampbell closed 8 years ago
Hello Ben,
Thanks for the PR, it looks good! I'm unsure at the moment if it feels right to have this in the package. One thing is the helper flags FlagsAllGreedy
and FlagsAllNonGreedy
, those were poorly thought-out and kind of prevent adding new flags in a sensible way (either you break the old behaviour of those flags by or
-ing the new one, or you break the semantics by leaving it out - it's not "All" anymore).
Also, I would think a common need may be to remove some well-known, referral-type query strings, while keeping others? The new flag wouldn't help in that case. I'm thinking out loud here, but maybe all that's missing is a FlagCustom
or a new NormalizeURLFunc(*url.URL, NormalizationFlags, func(*url.URL))
that calls the func after applying the provided normalization flags, so you can do what you want with the *url.URL
before returning the normalized string.
I'll think about that a bit more, and feel free let me know what you think about this.
Martin
Instead of custom flags or a NormalizeURLFunc()
with a callback, how about a nice simple:
func Normalize(u *url.URL, f NormalizationFlags) *url.URL
Which returns a *url.URL
, upon which the caller can perform their own further custom munging (with a nearby note explaining why they should use urlesc.Escape()
rather than the net/url
equivalent :- )
I know you're keen to keep the API minimal, so I think that'd be the most orthogonal way to handle custom munging...
I get what you mean about FlagsAllGreedy
and FlagsAllNonGreedy
. Actually, I think that's as good a reason as any to reject incompatible flags. A bit arbitrary, maybe, but hey.
Regarding FlagRemoveQuery
vs query filtering blacklist/whitelist... I can only really talk about my own use case, which is dealing with news and blog sites in bulk:
Yes, there are some well-known referral-type strings (like the google "utm_*" query tags), but sooooo many sites define their own ones too, which screw up my url-comparison code.
Maybe 1 in 100 sites have article URLs with significant query data (eg: http://example.com?article_id=12345
), but for the the other 99, the query part is always extraneous and needs to be stripped.
So, for my needs, I always start from the assumption that the query part of the URL is just noise.
Instead of custom flags or a NormalizeURLFunc() with a callback, how about a nice simple: func Normalize(u url.URL, f NormalizationFlags) url.URL
Very true, in fact the existing NormalizeURL
func already works for this, even though it returns the normalized string. The *url.URL
value is modified in-place, so the URL passed in is normalized after the call, and you can then remove the query string (u.RawQuery = ""
) if you want to ignore that part. The urlesc
package was temporary and that call is not required since Go1.5 (see #14).
So with that said, given that the proposed flag is a one-liner that can be done quite easily with the current API, I think I'll pass on the PR.
Thanks, Martin
I won't be at all offended if you think this flag has no place in purell, but it's one I would use almost exclusively :- )
context: I'm dealing with lots of news article URLs, and the unique part of the URL is almost always a slug, and the query used only for tracking (eg
?from=rssfeed
). Stripping the query is always my first step when normalizing URLs for comparison.eg (apologies for dailymail link):
http://www.dailymail.co.uk/news/article-3656841/It-s-Democrats-end-25-hour-sit-Congress-gun-control-walk-cheering-crowd-without-vote-demanded.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490
Just one of the many frustrations when trying to decide if two URLs are referring to the same article - there are loads more. Newspaper sites can commit some real atrocities when dealing with URLs!