URLs plugin - Better link extraction

rakiru commented 10 years ago

At the moment, it finds links by searching for https?:// and going from there until it finds a space. A minor issue with this is that it works even if that is on the end of a word (e.g. derphttp://) which is somewhat weird and could be an issue if issue #62 is implemented. The more major issue is that sometimes links are wrapped in quotes or parentheses, etc., which are currently considered to be part of the link. Obviously, it could be a bit dodgy to try parsing something like Lorem ipsum (dolor sit amet http://example.com/), but simply (http://example.com/) or "http://example.com/" should be easy enough. Perhaps take a look into how popular IRC clients do it, such as HexChat, and try to match that functionality.

gdude2002 commented 9 years ago

Lel. I had some free time and was bored, so I came up with this.

(?P<bracket>\(|)(?P<prefix>[^a-zA-Z\n]|[\(\)]|)(?P<protocol>[a-zA-Z0-9]+)://(?P<domain>[^/:\n\s]+)(?P<port>:[0-9]+|)(?:(?(bracket)(?P<url>/[^\)\n\s]+))|)(?P<end>(?(prefix)(?P=prefix)))

Expanded (with \x), this looks like..

(?P<bracket>\(|)
(?P<prefix>[^a-zA-Z\n]|[\(\)]|)
(?P<protocol>[a-zA-Z0-9]+)
://
(?P<domain>[^/:\n\s]+)
(?P<port>:[0-9]+|)
(?:
  (?(bracket)
    (?P<url>/[^\)\n\s]+)
  )|
)
(?P<end>
  (?(prefix)
    (?P=prefix)
  )
)

I tested with this, and it matches everything, with two problems..

The port also contains the ":" and I'm not sure how to fix that
This doesn't support URLs with usernames/passwords in them for basic auth

1  | http://google.com
2  | http://google.com:22
3  | http://google.com/some/url.html
4  | "http://google.com/lerp/merp"
5  | (http://google.com/herp/derp.html)
6  | ftp://ivy.gserv.me
7  | steam://stuff/stuff/more_stuff+search%20
8  | sftp://ivy.gserv.me
9  | http://google.com"
10 | 'http://af.aewrw/arew'
11 | #irc://irc.esper.net:1234/archives#

Results:

1  | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com"}
2  | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com", "port": ":22"}
3  | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com", "port": "", "url": "some/url.html"}
4  | {"bracket": "", "prefix": "\"", "protocol": "http", "domain": "google.com", "port": "", "url": "/lerp/merp", "end": "\""}
5  | {"bracket": "(", "prefix": "", "protocol": "http", "domain": "google.com", "port": "", "url": "/herp/derp.html"}
6  | {"bracket": "", "prefix": "", "protocol": "ftp", "domain": "ivy.gserv.me"}
7  | {"bracket": "", "prefix": "", "protocol": "steam", "domain": "stuff", "port": "", "url": "/stuff/more_stuff+search%20"}
8  | {"bracket": "", "prefix": "", "protocol": "sftp", "domain": "ivy.gserv.me"}
9  | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com\""}
10 | {"bracket": "", "prefix": "\'", "protocol": "http", "domain": "af.aewrw", "port": "", "url": "/arew", "end": "\|"}
11 | {"bracket": "", "prefix": "#", "protocol": "irc", "domain": "irc.esper.net", "port": ":1234", "url": "/archives", "end": "#"}

Zarthus commented 9 years ago

I personally use the following regex plus python code:


# These series of regexes are meant to work together. Alone they might lead to mismatches.
# The entire regex looks like this: (?:(\w+?)(?::\/\/))?([\w\.]+)\.([a-z]{2,16})(?::(\+?[1-9]\d{1,4}))?(\/[^ ]+)?
RE_PROTOCOL = r"(?:([\w\-_]+?)(?::\/\/))?"  # (1) match the protocol, strip :// e.g. https or ssh
RE_DOMAIN = r"([\w\-_\.]+)"  # (2) match the domain, and any subdomains. e.g. www.google, google, or maps.google
RE_TLD = r"\.([a-z]{2,16})"  # (3) The top level domain, does not account for unicode. e.g. com, info, org
RE_PORT = r"(?::(\+?[1-9]\d{1,4}))?"  # (4) The port. e.g. google.com:100 extracts 100
RE_PATH = r"(\/[^ ]+)?"  # (5) Get everything after the top level domain / port starting with a slash. e.g. /hi.html
URL_REGEX = re.compile(RE_PROTOCOL + RE_DOMAIN + RE_TLD + RE_PORT + RE_PATH, re.I)

    # in class context
    def _parse(self):
        """Parse an URL, filling all variables with their respective values."""
        match = URL_REGEX.match(self.url)

        if not match:
            return False

        groups = match.groups()
        subdom = groups[1].split('.')
        dom = subdom[len(subdom) - 1] if len(subdom) != 0 else None

        if dom:
            subdom.remove(dom)

        self.protocol = groups[0]
        self.subdomains = subdom
        self.domain = dom
        self.tld = groups[2]
        self.port = groups[3]
        self.path = groups[4]

It is not entirely matching what you'd like, but that's how I went around detecting URLs.

gdude2002 commented 9 years ago

Simplified what I had a bit.

(?P<prefix>[^\w\s\n]|)(?P<protocol>[\w]+)://(?P<domain>[^/:\n\s]+)(?P<port>:[0-9]+|)(?P<url>[^\s\n]+[^(?=prefix)]|)

(?P<prefix>[^\w\s\n]|)
(?P<protocol>[\w]+)
://
(?P<domain>[^/:\n\s]+)
(?P<port>:[0-9]+|)
(?P<url>[^\s\n]+|)

This seems fine - we can check the prefix and see if it's a bracket that's been matched, and then remove the paired one at the end - otherwise, remove the prefix from the end (both in code).

The only problem now is username:password@.

gdude2002 commented 9 years ago

This one matches basic auth.

(?P<prefix>[^\w\s\n]|)
(?P<protocol>[\w]+)
://
(?P<basic>[\w]+:[\w]+|)(?:@|)
(?P<domain>[^/:\n\s]+)
(?P<port>:[0-9]+|)
(?P<url>/[^\s\n]+|)

The only thing that needs to be thought about now is unicode.

gdude2002 commented 9 years ago

Finally, this one matches things like steam URLs properly.

(?P<prefix>[^\w\s\n]|)
(?P<protocol>[\w]+)
://
(?P<basic>[\w]+:[\w]+|)(?:@|)
(?P<domain>[^/:\n\s]+\.[^/:\n\s]+|)
(?P<port>:[0-9]+|)
(?P<url>[^\s\n]+|)

Any ideas about unicode?

gdude2002 commented 9 years ago

Forgot about more than one prefix char.

(?P<prefix>[^\w\s\n]+|)
(?P<protocol>[\w]+)
://
(?P<basic>[\w]+:[\w]+|)
(?:@|)
(?P<domain>[^/:\n\s]+\.[^/:\n\s]+|)
(?P<port>:[0-9]+|)
(?P<url>[^\s\n]+|)

gdude2002 commented 9 years ago

Closed in favour of #69

UltrosBot / Ultros

URLs plugin - Better link extraction #61