NET-A-PORTER / scala-uri

Simple scala library for building and parsing URIs
Other
261 stars 33 forks source link

Support for tld #108

Closed dkjhanitt closed 8 years ago

dkjhanitt commented 8 years ago

Hi, I'm looking for a utility to fetch tld. eg: For url = https://www.google.co.uk , tld should be uk and sld should be co.uk . Is there a way to achieve this ?

theon commented 8 years ago

Hi @dkjhanitt,

I think this is a good candidate for an enhancement to scala-uri. I have added a first attempt and published it as version 0.4.12-SNAPSHOT for you to try out.

It works by using the list of public suffixes from publicsuffix.org as there is no way to do this algorithmically.

Let me know if 0.4.12-SNAPSHOT works for you. There are a few things I'd like to do before cutting a proper release:

theon commented 8 years ago

To use the SNAPSHOT version, you will probably have to add a resolver to your SBT build. See: https://github.com/NET-A-PORTER/scala-uri#latest-snapshot-builds

theon commented 8 years ago

For usage see here: http://github.com/NET-A-PORTER/scala-uri#public-suffixes

dkjhanitt commented 8 years ago

Hi @theon , Thanks for the response. I tried to use the

"com.netaporter" %% "scala-uri" % "0.4.12-SNAPSHOT"

version, but running into the following error

Exception in thread "main" java.io.FileNotFoundException: src/main/resources/public_suffix_trie.json (No such file or directory)

Alternatively, I came across Google Guava which also grabs data from the publicsuffix.org

    Build.sbt dependency
    "com.google.guava" % "guava" % "16.0",
  import com.google.common.net.InternetDomainName
  val url = "mail.google.com"
  val url1 = "mail.google.co.uk"
  val id1 = InternetDomainName.from(url1)
  val id = InternetDomainName.from(url)
  println(id.topPrivateDomain(), id.parts(), id.publicSuffix())
  println(id1.topPrivateDomain(), id1.parts(), id1.publicSuffix())

Which prints

(google.com, [mail, google, com], com)
(google.co.uk, [mail, google, co, uk], co.uk)
dkjhanitt commented 8 years ago

Hi @theon , I created a pull request... Please review it and see if you can merge it into master.... https://github.com/NET-A-PORTER/scala-uri/pull/109

theon commented 8 years ago

Hi @dkjhanitt,

Thanks for getting back. The FileNotFoundException should be fixed now for 0.4.12-SNAPSHOT, sorry about that. I'll comment on the PR over there.

theon commented 8 years ago

I ran some benchmarks and am happy with the run time characteristics. The scalameter tests come out with about 10 nanoseconds for a uri.publicSuffix call. A crappy homemade benchmark comes out with 0 nanoseconds, probably because the call takes less than the resolution of System.nanoTime. Calling uri.publicSuffix to get a .com suffix should result in five .get() calls to five small maps (about 0-30 items), so I guess 10 nanoseconds sounds about right?

Memory wise, the Trie takes about 1.7MB of heap which isn't great, but that memory should only be consumed for users who call .publicSuffix and existing users should be unaffected. We can look at options to reduce memory usage if it becomes an issue for anyone.

scala-uri public suffixes heap usage

Based on this, I will cut version 0.4.12 this evening.

theon commented 8 years ago

0.4.12 has been released with this change.