URL-Detector / URL-Detector

A Java library to detect and normalize URLs in text
58 stars 11 forks source link

URL normalization mangles some URLs #10

Open redsk opened 4 years ago

redsk commented 4 years ago

Nowadays, some URLs use emojis (e.g. - careful, NSFW! - πŸ’„πŸ’ƒπŸ’.ga ). While correctly detected by the library, the normalization nullifies them, resulting in http://null/).

redsk commented 4 years ago

The problem here is that HostNormalizer uses java.net.IDN and java.net.IDN.toASCII(u) fails with

java.lang.IllegalArgumentException: java.text.ParseException: An unassigned code point was found in the input

as it only supports IDNA2003. For emoji IDNA2008 is needed and the icu4j library [1] supports it:

import com.ibm.icu.text.IDNA
val uts46 = IDNA.getUTS46Instance(IDNA.DEFAULT)

val u = "i❀.ws/" // this is safe for work :)
val punycodedDomain = uts46.nameToASCII(u, new java.lang.StringBuilder(), new IDNA.Info()).toString
// punycodedDomain == "xn--i-7iq.ws/"

Would you be interested in a PR that uses icu4j?

[1] https://mvnrepository.com/artifact/com.ibm.icu/icu4j

cohendekel commented 4 years ago

@redsk can you please share your thoughts about the pr for fixing the issue?

11

redsk commented 4 years ago

Sure, I just realised about it, after posting my last comment.

pgalbraith commented 4 years ago

@redsk can you confirm if this is fixed in the 1.0.0-SNAPSHOT build?

redsk commented 4 years ago

@pgalbraith can you provide a link to that build? AFAIK, URLs with emojis should be in the tests now. That said, the problem is more complex, as detailed here. I believe the final solution is in #12

pgalbraith commented 4 years ago

@redsk there is a snapshot build at https://oss.sonatype.org/service/local/repositories/snapshots/content/io/github/url-detector/url-detector/0.1.23-SNAPSHOT/url-detector-0.1.23-20200404.223402-1.jar

pgalbraith commented 4 years ago

Sorry wrong link ... new build is https://oss.sonatype.org/service/local/repositories/snapshots/content/io/github/url-detector/url-detector/1.0.0-SNAPSHOT/url-detector-1.0.0-20200527.164700-1.jar

redsk commented 4 years ago

@pgalbraith I can confirm that URLs with emojis are normalized correctly, namely:

πŸ’„πŸ’ƒπŸ’.ga -> xn--ir8hb8a.ga

However, it still does now work for www.râßner.de, which not normalized with the correct non-transitional www.xn--rner-vna1l.de but with the incorrect, transitional "http://www.xn--rner-vna1l.de/"