altmetric / embiggen

A Ruby library to expand shortened URLs
https://rubygems.org/gems/embiggen
MIT License
124 stars 6 forks source link

[FIX] Setup the regexp to match domain name once, a 50%~ speed up #16

Closed AvnerCohen closed 8 years ago

AvnerCohen commented 8 years ago

Before:

› for i in {1..5}; do bundle exec rspec | grep Finished ; done | cut -d" " -f 3
0.14006
0.13448
0.15444
0.16009
0.1727

After:

› for i in {1..5}; do bundle exec rspec | grep Finished ; done | cut -d" " -f 3
0.06168
0.05892
0.06377
0.06042
0.06269

All tests seem to pass. So cases such as "nobitly.com" or "bitl.c" are not matched because the test is to find an exact match in the url list.

mudge commented 8 years ago

Is there a risk here of incorrectly classifying links to www.bit.ly, etc. as unshortened as it doesn't exactly match the host in the whitelist?

AvnerCohen commented 8 years ago

I don't think so, See here, for example, I just tacked a www domain on a bitly url:

› curl -I -s "http://bit.ly/1PKenll" | grep Location
Location: http://www.israbirding.com/

ruby-2.1.2 in embiggen/ on use_set_check
› curl -I -s "http://www.bit.ly/1PKenll" | grep Location
Location: http://bit.ly/1PKenll

but regardless, I don't see a risky here, shorteners are trying to make things short, so a non used subdomain, that could be ignored, seems wasteful. if it's there, it's a mistake that can totally be ignored.

My view. It can also be implemented with a flag for "fuzzy" mode, but I doubt the added complexity worth it.

mudge commented 8 years ago

That's fair and shorteners do tell you exactly which URL to use so someone using www.t.co and www.bit.ly can be argued to be incorrect.

mudge commented 8 years ago

Discussing this with @mattmacleod, he suggested that the cost might be in the creation of the Regexps. If instead we did this once (instead of on every call to include?), we see similar speed-ups.

Can you see if the following is faster for you?

require 'forwardable'
require 'set'

module Embiggen
  class ShortenerList
    extend Forwardable
    include Enumerable

    attr_reader :domains

    def initialize(domains)
      @domains = Set.new(domains.map { |domain| host_pattern(domain) })
    end

    def include?(uri)
      domains.any? { |domain| uri.host =~ domain }
    end

    def +(other)
      self.class.new(domains + other)
    end

    def <<(domain)
      domains << host_pattern(domain)

      self
    end

    def delete(domain)
      domains.delete(host_pattern(domain))
    end

    def_delegators :domains, :size, :empty?, :each

    def host_pattern(domain)
      /\b#{domain}\z/i
    end
  end
end

Ideally this gives us both a speed-up and means we can add this spec:

it 'returns true if a URL host without a subdomain is on the whitelist' do
  list = described_class.new(%w(bit.ly))

  expect(list).to include(URI('http://www.bit.ly/foo'))
end
AvnerCohen commented 8 years ago

@mudge Using this approach works, and defiantly helps speed up things:

› for i in {1..5}; do bundle exec rspec | grep Finished ; done | cut -d" " -f 3
0.0607
0.06928
0.06216
0.06402
0.06059

That being said, we are still in a O(N) solution, looping through all entries and not taking advantage of Set so much.

I also think that it adds some complexity to the code for an edge case that maybe can simply be fixed by stripping subdomains from the input uri?

But I guess it's super not critical :), if you want me to go head with that approach, let me know, I can complete the end to end of it.

mudge commented 8 years ago

I think it's a worthwhile first step to speed things up a bit without changing behaviour (though, you are right that we're not changing the complexity of the lookup).

Re stripping subdomains: I'm not sure how we could do that reliably as URLs can have multiple subdomains (e.g. www.foo.bar.com) as well as multiple parts to their TLD (e.g. .co.uk) so it's not as simple as domain.split('.').last(2) or domain.split('.').drop(1).

AvnerCohen commented 8 years ago

@mudge

  1. That's a fair point, however, I am not convinced that domain patterns such as www.. or .co. are relevant to the url shorteners domain. These guys wants it as short as possible.
  2. Moreover, the existing "fuzzy" logic is wrong in a way. tinyyurl.com for example uses preview.tinyurl.com for specific usage, and maybe there is app.tinyurl.com, both are not straight "shortened" urls, but in existing logic, they will be counted as such.
  3. Lastly, if a user of the library wants to actually refer to www.bit.ly same as bit.ly, it is super easy to just add this to the file or extend the list using the _+_ operator.

There, I said it all :)

Regardless, I've updated the PR with your code.. which feels weird ... :)

mudge commented 8 years ago

You make a compelling case (particularly re tinyurl.com using different subdomains for entirely different functionality). That said, I think it's worth merging this to improve the performance for the current behaviour and release it as a minor upgrade.

Then we can change the behaviour to no longer include arbitrary subdomains in a separate release (as it might be considered a breaking change).