Closed AvnerCohen closed 8 years ago
Is there a risk here of incorrectly classifying links to www.bit.ly
, etc. as unshortened as it doesn't exactly match the host in the whitelist?
I don't think so, See here, for example, I just tacked a www domain on a bitly url:
› curl -I -s "http://bit.ly/1PKenll" | grep Location
Location: http://www.israbirding.com/
ruby-2.1.2 in embiggen/ on use_set_check
› curl -I -s "http://www.bit.ly/1PKenll" | grep Location
Location: http://bit.ly/1PKenll
but regardless, I don't see a risky here, shorteners are trying to make things short, so a non used subdomain, that could be ignored, seems wasteful. if it's there, it's a mistake that can totally be ignored.
My view. It can also be implemented with a flag for "fuzzy" mode, but I doubt the added complexity worth it.
That's fair and shorteners do tell you exactly which URL to use so someone using www.t.co
and www.bit.ly
can be argued to be incorrect.
Discussing this with @mattmacleod, he suggested that the cost might be in the creation of the Regexp
s. If instead we did this once (instead of on every call to include?
), we see similar speed-ups.
Can you see if the following is faster for you?
require 'forwardable'
require 'set'
module Embiggen
class ShortenerList
extend Forwardable
include Enumerable
attr_reader :domains
def initialize(domains)
@domains = Set.new(domains.map { |domain| host_pattern(domain) })
end
def include?(uri)
domains.any? { |domain| uri.host =~ domain }
end
def +(other)
self.class.new(domains + other)
end
def <<(domain)
domains << host_pattern(domain)
self
end
def delete(domain)
domains.delete(host_pattern(domain))
end
def_delegators :domains, :size, :empty?, :each
def host_pattern(domain)
/\b#{domain}\z/i
end
end
end
Ideally this gives us both a speed-up and means we can add this spec:
it 'returns true if a URL host without a subdomain is on the whitelist' do
list = described_class.new(%w(bit.ly))
expect(list).to include(URI('http://www.bit.ly/foo'))
end
@mudge Using this approach works, and defiantly helps speed up things:
› for i in {1..5}; do bundle exec rspec | grep Finished ; done | cut -d" " -f 3
0.0607
0.06928
0.06216
0.06402
0.06059
That being said, we are still in a O(N) solution, looping through all entries and not taking advantage of Set so much.
I also think that it adds some complexity to the code for an edge case that maybe can simply be fixed by stripping subdomains from the input uri?
But I guess it's super not critical :), if you want me to go head with that approach, let me know, I can complete the end to end of it.
I think it's a worthwhile first step to speed things up a bit without changing behaviour (though, you are right that we're not changing the complexity of the lookup).
Re stripping subdomains: I'm not sure how we could do that reliably as URLs can have multiple subdomains (e.g. www.foo.bar.com
) as well as multiple parts to their TLD (e.g. .co.uk
) so it's not as simple as domain.split('.').last(2)
or domain.split('.').drop(1)
.
@mudge
There, I said it all :)
Regardless, I've updated the PR with your code.. which feels weird ... :)
You make a compelling case (particularly re tinyurl.com using different subdomains for entirely different functionality). That said, I think it's worth merging this to improve the performance for the current behaviour and release it as a minor upgrade.
Then we can change the behaviour to no longer include arbitrary subdomains in a separate release (as it might be considered a breaking change).
Before:
After:
All tests seem to pass. So cases such as "nobitly.com" or "bitl.c" are not matched because the test is to find an exact match in the url list.