Possible improvements to linker

pythonesque commented 10 years ago

I noticed that there are some things about linker (the user hook into the bbcode parser that activates on link detection) that could be made a bit nicer. The biggest one is:

Give linker access to the context like render() has. Without this I'm not sure there's an idiomatic way of getting this information to the linker, so I am getting around it by overriding large swathes of the Parser class.

It would be convenient (and quite easy, I suspect) to make the following changes, but I can work around all of them by overriding the URL formatter so they're not critical:

linker is used by default in _render_url, not just _link_replace (there are only very slight differences between them as it is).
Allow different schemes than http:// for link replacement (https:// is the obvious one that comes to mind).
For _domain_re, allow a custom list of matchable domains (there are probably clever ICANN related things you could do to get and cache an official list but that's probably a library in itself).

From a perf standpoint:

Have you tested whether it's faster to not capture the https?:// the first time around (rather than using a noncapturing group, or doing a full regex replace)? I.e., change _url_re to

r'(?im)\b((?:((https?://)|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^\s`!()\[\]{};:\'".,<>?]))

and then use the new capture group to decide whether it was necessary to check for prepending http(s)? Obviously it's very dependent on the regex implementation but this technique wouldn't require rescanning the URL again, and would also make the (I suspect) common task of matching all links to a given domain faster.

(Also, I feel like domain_re would be, if not faster, at least more correct if it was explicitly anchored to the beginning of the string with ^--any reason it isn't?)

dcwatson commented 10 years ago

Thanks for the suggestions. It seems most of them could be done by just implementing your own [url] renderer that calls the same linker function you provide, and does whatever domain matching and http/https replacement you want. Not that adding customization hooks into the default renderer is a bad idea.

I think passing the context to a linker function is a good idea, but will need to think about backwards compatibility at this point. Trying to call self.linker(url, context) will break existing linkers that just take the url (and no args/*kwargs).

As for performance, I haven't done any serious benchmarking. Are you experiencing performance problems, or do you have any numbers that indicate your suggestions would be an improvement? For my own usage patterns, shaving a few microseconds off a regular expression match won't matter, but I'd be happy to accept patches that are demonstrably faster and pass the tests.

pythonesque commented 10 years ago

Yeah, most of those suggestions are just convenience things, no big deal not to add them :)

As far as context goes, I understand where backwards compatibility would be a concern. I was hoping that since it was relatively undocumented (at least, it's not explicitly described on readthedocs, even though it's commented in the source code) that the interface wasn't stable yet. I just really can't think of any other (thread-safe) way to actually pass in context :( Creating a v2 linker or doing some sort of introspection just seems ugly. Maybe make it a module-level option whether context gets passed in or not, a la the thread safe extensions to yacc?

I'll try benchmarking and see what I come up with. Based on the limited tests I've run your module is already pretty optimized, so it may not make much of a difference.

dcwatson commented 10 years ago

I'm wondering if some simple introspection would be that bad. Something like:

if linker.func_code.co_argcount == 2 or linker.func_code.co_flags & 4 != 0: linker(url, context)
elif linker.func_code.co_flags & 8 != 0: linker(url, **context)
else: linker(url)

dcwatson commented 10 years ago

Also, is there a reason you can't get around this by doing something like:

def linker(url, context=None):
    ...
parser = bbcode.Parser(linker=functools.partial(linker, context=whatever))

pythonesque commented 10 years ago

I'll answer your second question first: I can't do that, because the context is different each time I parse (I'm using information from the request header in Django).

First question, your call. I'm generally not a fan of reflection, but it's your project :)

dcwatson commented 10 years ago

This is merged. I'm working on a release, should have something soon-ish. Thanks for the feedback and pull request!

dcwatson commented 10 years ago

FYI, I just pushed out version 1.0.16 to PyPI (and added some docs on this and other things at http://bbcode.readthedocs.org/en/latest/). Sorry for the delay, and thanks for the contribution!

pythonesque commented 10 years ago

Awesome, thanks! On May 5, 2014 7:10 PM, "Dan Watson" notifications@github.com wrote:

FYI, I just pushed out version 1.0.16 to PyPI (and added some docs on this and other things at http://bbcode.readthedocs.org/en/latest/). Sorry for the delay, and thanks for the contribution!

— Reply to this email directly or view it on GitHubhttps://github.com/dcwatson/bbcode/issues/11#issuecomment-42261032 .

dcwatson / bbcode

Possible improvements to linker #11