coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
634 stars 91 forks source link

performance suggestion #73

Closed jvanasco closed 7 years ago

jvanasco commented 7 years ago

I'm considering migrating to micawber from a custom oembed consumer, and wanted to suggest a performance improvement that I am willing to generate a PR for.

I'd like to extend the ProviderRegistry with a secondary internal register that nests providers under domain names.

this would allow users to optionally avoid a regex match against every provider and only test the domain.

some light tests on a quick mockup showed the lookups to run in 30% the time -- including the overhead of parsing the domain name from a url, but about 5% of the time if you have the domain already.

we would be using this on a high volume indexer, so this performance is a need.

coleifer commented 7 years ago

I would suggest subclassing the existing ProviderRegistry and overriding the register and unregister methods in your subclass. You can parse out the hostname using urlparse(url).netloc. The bootstrap_* methods all accept a ProviderRegistry instance so you can provide an instance of your subclass and get everything registered easily.

jvanasco commented 7 years ago

ah, subclassing is a good idea. i'll redo the PR I made.

coleifer commented 7 years ago

I was suggesting that you subclass it in your own application -- I don't plan to merge the optimization into the codebase in the interests of keeping things simple and extensible.

jvanasco commented 7 years ago

That's fair.

Would you be interested at all in this reorg of migrating the providers into a dict? You can strip out the 'domain' key if you want.

https://github.com/coleifer/micawber/compare/master...jvanasco:feature-library

fyi, urlparse(url).netloc is the hostname + port (if present in a url). you can use urlparse(url).hostname to ensure only a hostname. some of the endpoints serve subdomains though, so a library like tldextract would be needed to parse out the actual "registered" domain.