URLBear: use library to extract links

jayvdb commented 7 years ago

As we hit more corner cases for url extraction, our regex approach will be a limiting factor, and we should switch to using an external library. We would need to import our regression tests into the external library.

https://pypi.python.org/pypi/urlextract might be useful.

Or we may find a 'recursive' web scraper which handles file types of than .html. I.e. there maybe a function in https://scrapy.org which handles plain text documents.

sils commented 7 years ago

@maweki proposed once to make a proper parser for this instead of a regex. Using/improving an external library sounds gorgeous, we should make sure all our existing testcases pass.

maweki commented 7 years ago

The problem the regex parser had was/is, that it probably can't 100% correctly parse all valid urls according to the RFC I am too lazy to link here.

On the other hand, the InvalidLinkBear shoud, depending on both context and content, ignore urls. There was the example that the host might be example.com or localhost and can be ignored. Any url that has a scheme that is not one of http/https/ftp or whatever we generally support, can be checked at all (this depends on the protocol support our user-agent, e.g. coala, has). And the initial discussion @sils refers to came to be because urls within xml-definitions or owl datasources need not be physically accessible or represent any actual content that could be generated but is just a persistent and unique identifier for some thing or idea.

I think we had most problems with including context within a single regex that is just supposed to check for content. So maybe the whole structure of the InvalidLinkBear needs to be re-thought, to separate content and context.

Also content isn't that easy either. If we had something like "please go to www.example.com/www.example2.com" we could think that two sepearate domains without any document is meant or one domain with a document.

(Edit: thanks github for illustrating my point)

So I basically think that the invalidLinkBear should do the following: Find any possible largest url (this was tricky with whitespaces being allowed in quoted urls, if I remember correctly). Create some context like in which kind of document area of code is this occuring. Split it into sensible shorter urls according to context (are quoted whitespaced strings allowed in markdown or just in anything html-ish? Are we in xml-def? Is this a context where we're just interested in domains?) and check all of them (if the context tells us, we don't want to actually check, this would be just empty). If all fail, inform the user of the tried variants and let the bear fail. If there are multiple possible ones and some fail, let the Bear succeed but on debug/verbose mode show all tried variants.

But yeah, this needs reworking to include the context, which seems to be a hard barrier for the regex-approach.

maweki commented 7 years ago

To take the example from https://github.com/coala/coala-bears/issues/1338, http://harelba.github.io/q/](http://harelba.github.io/q/) is basically the largest url (because this is all allowed) but the sensible shorter urls in markdown context would be the set of (http://harelba.github.io/q/](http://harelba.github.io/q/), http://harelba.github.io/q/](http://harelba.github.io/q/, http://harelba.github.io/q/](, http://harelba.github.io/q/], http://harelba.github.io/q/) because in markdown context we would split at (, ), [,]. Also we would possibly get a second overlapping hit for http://harelba.github.io/q/) which would be split according to the same rules.

ankitxjoshi commented 6 years ago

If we intend to work use urlextract. This issue should be blocked until following issues are resolved in the upstream.

jayvdb commented 6 years ago

Great, but we should also look at other libraries. It can be hard to find obscure libraries like this, but often very rewarding.

maweki commented 6 years ago

The problem sometimes is that the parsers work as intended (here https://github.com/lipoja/URLExtract/issues/13 - this is allowed - a lot is allowed on the path part and even more on the query part of the url). Some of the stuff we parse correctly are urls but in the context it is more useful to use one of the smaller matches. Depending on the context smaller matches would be sensible. A regex based parser is possibly (probably) not the best choice for the many overlapping matches.

ankitxjoshi commented 6 years ago

Yes, but they aren't allowed in the termination as far as I know. And yes I totally agree that a perfect regex parser is a difficult task. At best we can just limit our tolerance for the definition of a valid url.

ankitxjoshi commented 6 years ago

Scrapy as of now doesn't provide with any function capable of parsing url from raw text. Had a discussion on their channel regarding this and they suggested to use other 3rd party libraries instead.

maweki commented 6 years ago

but they aren't allowed in the termination as far as I know

Please consider reading the RFC. https://tools.ietf.org/html/rfc3986#section-3.4

Here are the definitions:

query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG

A perfectly valid url with query part and without fragment can basically end with anything but "#" / "[" / "]". The fragment could additionally end with a question mark.

jayvdb commented 6 years ago

URLExtract has been fixing lots of bugs, so it looks like it will be a viable solution soon.

coala / coala-bears

URLBear: use library to extract links #1342