Fuzzyurl blows up when the url includes a trailing space

fuzzygroup commented 7 years ago

I was curious about replacing my internal url handling libraries with fuzzyurl so I ran thru 88,208 urls out of a 250 K url sample and it died with this url:

p = Page2016Q2.find(89100) => #<Page2016Q2 id: 89100, created_at: "2016-06-21 22:56:11", updated_at: "2016-06-21 22:56:11", date_created_at: "2016-06-21", title: " News, Sports, Jobs - The Intelligencer / Wheeling...", url: "http://www.theintelligencer.net ", url_no_www: "theintelligencer.net ", url_no_www_sha1: "978dfa9737dfcb6164a0118ffb30a31cce10af3f", site_id: 87177, crawl_id: 76958, page_body_id: 89100, long_url_id: nil, html_size: 76453, ascii_size: 0, link_id: 2645550> 2.3.1 :003 > puts "|#{p.url}|"

|http://www.theintelligencer.net | <=== the issue is that a trailing space slipped into the url and it blows up the parsing => nil

2.3.1 :004 > fu = Fuzzyurl.from_string(p.url) ArgumentError: Couldn't parse url string: http://www.theintelligencer.net from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/fuzzyurl-0.9.0/lib/fuzzyurl/strings.rb:32:in from_string' from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/fuzzyurl-0.9.0/lib/fuzzyurl.rb:169:infrom_string' from (irb):4 from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/railties-4.2.7/lib/rails/commands/console.rb:110:in start' from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/railties-4.2.7/lib/rails/commands/console.rb:9:instart' from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/railties-4.2.7/lib/rails/commands/commands_tasks.rb:68:in console' from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/railties-4.2.7/lib/rails/commands/commands_tasks.rb:39:inrun_command!' from /var/www/apps/banks/shared/bundle/ruby/2.3.0/gems/railties-4.2.7/lib/rails/commands.rb:17:in <top (required)>' from bin/rails:4:inrequire' from bin/rails:4:in `

'

gamache commented 7 years ago

Thanks for raising this issue! I will have to think on this a bit, because I do not consider "http://theintelligencer.net" (with trailing space) a valid URL -- spaces in hostnames are illegal per RFC 952. I see potential value in stripping whitespace from URLs before parsing, but I want to think of potential drawbacks to the approach before adopting it.

gamache commented 7 years ago

I didn't have to think at all; a committee did it for me! From RFC 1738, Uniform Resource Locators:

Characters can be unsafe for a number of reasons.  The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems.  The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it.  The character "%" is unsafe because it is used for
encodings of other characters.  Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "\", "^", "~",
"[", "]", and "`".

All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.

Good enough for me. I will strip whitespace in the next release.

Are you interested in that corpus of URLs being integrated into the Fuzzyurl projects as test data?

fuzzygroup commented 7 years ago

Excellent, excellent. Than you so much Pete. Let me talk to the founder here and I'll see about getting this or an even larger subset of urls released. There are some absolute stinkers in this project.

gamache / fuzzyurl.rb

Fuzzyurl blows up when the url includes a trailing space #2