altmetric / embiggen

A Ruby library to expand shortened URLs
https://rubygems.org/gems/embiggen
MIT License
124 stars 6 forks source link

Is it possible to detect malformed URLs? #22

Closed krngrvr09 closed 5 years ago

krngrvr09 commented 8 years ago

Embiggen::URI("http://t.").expand returns http://t/. I think it should give an error BadShortenedURI.

krngrvr09 commented 8 years ago

The official syntax for a URL says that http://t. is a valid url.

So it is not a malformed URL. The reason it returns that value is that Embiggen::URI("http://t.").shortened? returns false after checking if http://t/ is present in Configuration.shorteners.

One solution is to add

Embiggen.configure do |config|
    config.shorteners += %w(t)
end

But this would not catch other similar errors, in the future. For example, I just found that I am getting http:/ in my list of urls and that too gets expanded as http:/

mudge commented 8 years ago

Hi @krngrvr09,

Yeah, as you say, those are valid URIs according to Addressable:

Addressable::URI.parse('http://t.').normalize
=> #<Addressable::URI:0x3fe4a32c7924 URI:http://t/>
Addressable::URI.parse('http:/')
=> #<Addressable::URI:0x3fe4a32892f0 URI:http:/>

Would it be possible to validate URIs according to your criteria before passing them to Embiggen? e.g.

potential_uris.each do |potential_uri|
  uri = URI(potential_uri)
  next unless uri.host && uri.host.split('.').length > 1

  expanded_uri = Embiggen::URI(uri).expand

  # ...
end

I'm unsure about adding such logic to Embiggen itself as you may legitimately want to expand links with unusual but valid URIs (e.g. if talking to an internal DNS).

krngrvr09 commented 8 years ago

Yeah, I agree. It does not make sense to add that logic to the library itself.

I was wondering why have you taken the approach where you check the given url against a list of short domains. Can't we just follow the redirects, and the final URL will be the expanded URL. Like implemented here

mudge commented 8 years ago

It's a good question: the reason we pre-filter URLs before attempting to expand them is down to the volume of URLs we process (e.g. we consume the Twitter firehose).

If you want to change this behaviour, you could try a different config.shorteners that returns true for all URLs as given in the README:

Embiggen.configure do |config|
  config.shorteners = Module.new do
    def self.include?(_)
      true
    end
  end
end