StephenNi / language-detection

Automatically exported from code.google.com/p/language-detection
0 stars 0 forks source link

URL pattern matching doesn't match every URL #27

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Another user hijacked Issue 26 with an improvement to the URL pattern matching:

//private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}");
private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.,?&~;+=/#0-9A-Za-z]{1,2076}");

I think it's possible to do better because there are a number of issues:

1. The host part doesn't use a separate regular expression.  Hosts can't 
contain "?". "&", ";" and so forth, so this would allow the regular expression 
more quickly to determine non-matches.
2. There are more URL schemes than just "http" and "https".
3. Some URL schemes are more structured than others.  For instance, "mailto" 
doesn't actually have any of the the slashes (all "opaque" URL schemes are like 
this.)
4. It might be good to match international ones too, but this one only matches 
ASCII ones.

Original issue reported on code.google.com by trejkaz on 19 Oct 2011 at 9:50

GoogleCodeExporter commented 8 years ago
This filtering is not to retrieve mail addresses or URLs but to REMOVE them for 
detection accuracy.
In the actual text, lots of representations are not under rules too.
So I consider the strict rule is unnecessary.

And I suppose that each application removes more perticular representation, if 
necessary.

Original comment by nakatani.shuyo on 20 Oct 2011 at 6:35