Warning/error on invalid/non-ASCII characters in URL

Is your feature request related to a problem? Please describe. We use Unirest in our application, and recently I encountered an issue that was very hard to troubleshoot. We configure a UnirestInstance with a default base URL, and then call individual endpoints on this base URL using calls like unirestInstance.get("/api/...") and unirestInstance.post("/api/...").

All pretty standard, however for one of these invocations we ran into a java.net.UnknownHostException, even though that request was using the same default base URL as some of the other requests, and copying/pasting the host name shown in the exception message into an nslookup command didn't reveal any issues.

Even a breakpoint on InetAddress::getAllByName seemed to show the exact same hostname for both successful requests and the failing request. Only after doing a host.getBytes() and comparing byte contents, I noticed that the failing host name had 3 extra bytes at the end: -30 -128 -117

From a Unirest perspective, these 3 bytes didn't come from the default base URL (as that would have caused other requests to fail as well), but from the endpoint passed in one of the get/post methods. This endpoint URI was likely copied/pasted from a Swagger page, and apparently this resulted in these 3 invisible bytes being present at the start of the endpoint URI. Being before the first slash, apparently these 3 bytes were appended to the host name, eventually causing the UnknownHostException.

Describe the solution you'd like Not sure whether Unirest should simply reject URL's containing invalid/non-ASCII characters (which I guess could break some existing applications; maybe have this configurable and disabled by default), log a warning, ...?

Also, I guess Unirest shouldn't have appended those 3 bytes to the host name; these 3 bytes should have been considered part of the path.

Describe alternatives you've considered None

Additional context N/A

host.getBytes() is not the right thing to call to see what the system would actually call on the network. a URL in java can in fact contain non-ascii characters, but they must be converted to ascii when used on the wire with URI::toASCIIString

So I would expect that if you dug down into Apache that the host it tried to resolve had legit ASCII chars, but then when you get the UnknownHostException it has the "human readable" version. Which I get is confusing, but it's not technically wrong or impossible.

Could unirest try and get even more strict than Java's own URI class? Sure, but we would probably break some legit use. Could we make it a config? Sure but, that adds complexity and all we would end up doing is throwing a different exception from the UnknownHostException you are already getting. So it seems not of much value. I would recommend if this is a concern for your use that you could add an interceptor or pre-processor to validate your own expectations on URIs

Kong / unirest-java

Warning/error on invalid/non-ASCII characters in URL #465