URLs plugin rewrite - Githubissues

gdude2002 commented 9 years ago

This ticket goes hand-in-hand with #62 and #61 - The URLs plugin needs rewritten.

I've already started this, in the URLs branch. URLs needs to be made cleaner, more modular, and more efficient, and there are several steps to get to that stage.

[x] Firstly, we need to look at how URLs are detected. As detailed in #61, I'll be implementing my latest regex for this, unless we come up with a better idea.
[ ] After that, we need to implement the functionality. I believe @rakiru wants to use Treq and deferreds instead of threads and urllib, and I would totally agree with this use.
[ ] Finally, we need to make sure it all ties together, and make sure that any plugins using the URLs API as it is don't break or are updated.

rakiru commented 9 years ago

perhaps we should even have catching, parsing and dispatching to handlers as separate modules

What? That doesn't even make sense. Aside from the fact that the whole point of the plugin is to catch and parse URLs, whatever part does this will have to dispatch it in some sense wherever else would do the "dispatching".

gdude2002 commented 9 years ago

Well hey, this is why we have tickets.

Yeah, I guess that makes sense.

gdude2002 commented 9 years ago

Okay, so I've been thinking about this, while we can easily support matching chars before and after the URL, what paired chars do we need to support?

As an example, here's what I came up with.

()
[]
{}
<>

gdude2002 commented 9 years ago

Additionally, does this seem sane, @rakiru?

class Handler(object):
    """
    URL handler. Subclass this!

    You'll want to override both the `call` method and the *criteria* dict.
    The former is called if the criteria matches the URL.

    In the criteria dict, you're expected to provide values to test equality
    for. However, there are a few things to be aware of.

    * Leave a key out if you don't care about matching it - None will be
      matched against.
    * You can use plugins.urls.utils.notnone.NotNone when you only want to be
      sure something exists.
    * You may provide a compiled regular expression to test against as well.
    * Finally, you may provide a callable (function or class), which will be
      run for the comparison instead, and should return either True or False.

    >>> criteria = {
    ...     # No protocol here, since we don't care about it
    ...     "auth": NotNone,
    ...     "domain": re.compile("[a-zA-Z]+"),
    ...     "port": lambda x: x > 8080,
    ...     "path": lambda x: len(x) > 10
    ... }
    >>> 
    """

    plugin = None

    criteria = {
        "protocol": None,
        "auth": None,
        "domain": None,
        "port": None,
        "path": None
    }

    def __init__(self, plugin):
        self.plugin = plugin

    def call(self, url):
        raise NotImplementedError()

rakiru commented 9 years ago

" and ' would be wise to support too.

NotNone does not sound sane, no. I'm not sure the use-case for needing any-value-but-None is great enough to warrant not simply using lambda x: x is not None. The rest of it sounds good though, although I'm wondering if there will ever be a case where something relies on more than one criteria in a linked way.

gdude2002 commented 9 years ago

Yeah, that's fair actually, can just use a lambda.

Also, " and ' aren't paired, are they? I meant stuff we'd need to transform as they'd logically be paired with a different char.

rakiru commented 9 years ago

What? They're paired with themselves.

gdude2002 commented 9 years ago

Yeah, exactly. As I said, they're paired with the same char, so they're easy to support and won't need specific handling. As I said, ( and ) are examples of what I meant :P

rakiru commented 9 years ago

Oh right, I didn't realise we were going to support arbitrary identical surrounding characters.

rakiru commented 9 years ago

On a completely different note, feature request: ability to provide cookies in the config.

gdude2002 commented 9 years ago

Yep, been thinking about that one, it's a possibility. I might also possibly add the ability to assign a cookie jar to a site and have the plugin do a few requests on startup to populate it, but that'll be a much later thing if I do that.

gdude2002 commented 9 years ago

Alright, is this good?

gdude2002 commented 8 years ago

I have /not/ been keeping this up to date. Things left to do (as of my memory):

[ ] Check portscan on redirect in website handler
[ ] Port URL-Tools
[ ] Better API for accessing different parts of the configuration
- [ ] See below

rakiru commented 8 years ago

To add to that last point:

[ ] get_accept_language(domain)
[ ] get_spoofing(domain)
[ ] is_blacklisted(domain)

Edit: check_blacklist(_url, context) already exists. It makes sense to do this on the whole URL (regex-based), but _url should be changed to url. I don't understand the context thing well enough to know if that needs changed too. If context is necessary, perhaps make it optional and fall back to self.config for context["config"].

rakiru commented 8 years ago

Perhaps website.py should use the lazy request and the stuff like portscan protection and spoofing should be moved into there? I can see things wanting to do basically what website.py does already except scrape the page for data instead of grabbing it from the title. This seems like the most sensible way (an alternative being to move the data processing stuff in website.py to its own function and have similar handlers subclass it), but it has a problem in the edge case of a handler wanting a page bigger than max_read_size (although the default will probably be 8 or 16KB, which seems plenty?).

gdude2002 commented 8 years ago

Well, I kinda don't want portscan protection to be in the main handling stuff since someone might want to write a handler to do special things on localhost/local domains, although in that case, it might just be better to have a switch to disable it?

Also I'm sure there are plenty of pages bigger than that.

rakiru commented 8 years ago

For a handler doing things on local domains, I'd assume it'd probably do its own request, but I guess that's a fair point. I guess it depends on the lazy request use case. Leaving it blank for now seems fair, as long as no overly-generic handler uses it without protection.

gdude2002 commented 8 years ago

[ ] Clean up code - Remove "config" from context, make it clear where handlers should be getting urls plugin config from, remove context params where they aren't necessary

rakiru commented 8 years ago

Should probably add a configurable timeout on requests for website handler. That's just passing a timeout argument to get(), and the config option could be connection.timeout.

rakiru commented 8 years ago

Done #75

UltrosBot / Ultros

URLs plugin rewrite #69