URL pattern matching doesn't match every URL

Another user hijacked Issue 26 with an improvement to the URL pattern matching:

//private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}");
private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.,?&~;+=/#0-9A-Za-z]{1,2076}");

I think it's possible to do better because there are a number of issues:

1. The host part doesn't use a separate regular expression.  Hosts can't 
contain "?". "&", ";" and so forth, so this would allow the regular expression 
more quickly to determine non-matches.
2. There are more URL schemes than just "http" and "https".
3. Some URL schemes are more structured than others.  For instance, "mailto" 
doesn't actually have any of the the slashes (all "opaque" URL schemes are like 
this.)
4. It might be good to match international ones too, but this one only matches 
ASCII ones.

Original issue reported on code.google.com by trejkaz on 19 Oct 2011 at 9:50

andreydelpozo2 / language-detection

URL pattern matching doesn't match every URL #27