Closed gdude2002 closed 8 years ago
perhaps we should even have catching, parsing and dispatching to handlers as separate modules
What? That doesn't even make sense. Aside from the fact that the whole point of the plugin is to catch and parse URLs, whatever part does this will have to dispatch it in some sense wherever else would do the "dispatching".
Well hey, this is why we have tickets.
Yeah, I guess that makes sense.
Okay, so I've been thinking about this, while we can easily support matching chars before and after the URL, what paired chars do we need to support?
As an example, here's what I came up with.
()
[]
{}
<>
Additionally, does this seem sane, @rakiru?
class Handler(object):
"""
URL handler. Subclass this!
You'll want to override both the `call` method and the *criteria* dict.
The former is called if the criteria matches the URL.
In the criteria dict, you're expected to provide values to test equality
for. However, there are a few things to be aware of.
* Leave a key out if you don't care about matching it - None will be
matched against.
* You can use plugins.urls.utils.notnone.NotNone when you only want to be
sure something exists.
* You may provide a compiled regular expression to test against as well.
* Finally, you may provide a callable (function or class), which will be
run for the comparison instead, and should return either True or False.
>>> criteria = {
... # No protocol here, since we don't care about it
... "auth": NotNone,
... "domain": re.compile("[a-zA-Z]+"),
... "port": lambda x: x > 8080,
... "path": lambda x: len(x) > 10
... }
>>>
"""
plugin = None
criteria = {
"protocol": None,
"auth": None,
"domain": None,
"port": None,
"path": None
}
def __init__(self, plugin):
self.plugin = plugin
def call(self, url):
raise NotImplementedError()
"
and '
would be wise to support too.
NotNone does not sound sane, no. I'm not sure the use-case for needing any-value-but-None is great enough to warrant not simply using lambda x: x is not None
. The rest of it sounds good though, although I'm wondering if there will ever be a case where something relies on more than one criteria in a linked way.
Yeah, that's fair actually, can just use a lambda.
Also, "
and '
aren't paired, are they? I meant stuff we'd need to transform as they'd logically be paired with a different char.
What? They're paired with themselves.
Yeah, exactly. As I said, they're paired with the same char, so they're easy to support and won't need specific handling. As I said, (
and )
are examples of what I meant :P
Oh right, I didn't realise we were going to support arbitrary identical surrounding characters.
On a completely different note, feature request: ability to provide cookies in the config.
Yep, been thinking about that one, it's a possibility. I might also possibly add the ability to assign a cookie jar to a site and have the plugin do a few requests on startup to populate it, but that'll be a much later thing if I do that.
I have /not/ been keeping this up to date. Things left to do (as of my memory):
To add to that last point:
get_accept_language(domain)
get_spoofing(domain)
is_blacklisted(domain)
Edit: check_blacklist(_url, context)
already exists. It makes sense to do this on the whole URL (regex-based), but _url
should be changed to url
. I don't understand the context thing well enough to know if that needs changed too. If context
is necessary, perhaps make it optional and fall back to self.config
for context["config"]
.
Perhaps website.py should use the lazy request and the stuff like portscan protection and spoofing should be moved into there? I can see things wanting to do basically what website.py does already except scrape the page for data instead of grabbing it from the title. This seems like the most sensible way (an alternative being to move the data processing stuff in website.py to its own function and have similar handlers subclass it), but it has a problem in the edge case of a handler wanting a page bigger than max_read_size
(although the default will probably be 8 or 16KB, which seems plenty?).
Well, I kinda don't want portscan protection to be in the main handling stuff since someone might want to write a handler to do special things on localhost/local domains, although in that case, it might just be better to have a switch to disable it?
Also I'm sure there are plenty of pages bigger than that.
For a handler doing things on local domains, I'd assume it'd probably do its own request, but I guess that's a fair point. I guess it depends on the lazy request use case. Leaving it blank for now seems fair, as long as no overly-generic handler uses it without protection.
"config"
from context
, make it clear where handlers should be getting urls plugin config from, remove context
params where they aren't necessaryShould probably add a configurable timeout on requests for website handler. That's just passing a timeout
argument to get()
, and the config option could be connection.timeout
.
Done #75
This ticket goes hand-in-hand with #62 and #61 - The URLs plugin needs rewritten.
I've already started this, in the URLs branch. URLs needs to be made cleaner, more modular, and more efficient, and there are several steps to get to that stage.