cmunk / utopiaintel

Sample scripts to parse pages from utopia-game.com
4 stars 6 forks source link

Apparently invalid url encoding for mail, war forums, and kingdom forums #5

Open Volcanon- opened 7 years ago

Volcanon- commented 7 years ago

I haven't dug too deeply, but this code:

import java.net.URLDecoder; URLDecoder.decode(newIntel, "UTF-8");

Produces this error:

java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u2"

I have to assume that the Java lib's URL Decoder is able to do the job rightly, unless perhaps its not UTF-8?

cwm22 commented 7 years ago

Did you ever get this working? I am in the same boat.

Volcanon- commented 7 years ago

Nope, I don't ingest any of them since the in-game versions seem to do fine enough. I just did a try/except and skipped these pages.

cwm22 commented 7 years ago

I actually have this problem on every page using the javascript decodeURI functions so I used the nodejs querystring module.

Volcanon- commented 7 years ago

Oh that's interesting.

The querystring module worked?

Here's my implementation: https://bitbucket.org/fredrik_yttergren/lucidbot/commits/d6e4f32b63cb296616fc7a674bfde0b07c86ba04#chg-Utopia-WS/src/web/resources/UtopiaInGameParserResource.java

cwm22 commented 7 years ago

I have not tried the forums yet, but it worked like a charm on other pages.

Volcanon- commented 6 years ago

This issue was due to running out of UTF-8 encoding characters. Essentially when it doesn't have something, it encodes it as %uXXXX

REF: https://msdn.microsoft.com/en-us/library/h3607h29(v=vs.84).aspx

I implemented a fix on my side, but a bit hacky:

https://bitbucket.org/fredrik_yttergren/lucidbot/commits/acbff649d5b660e2d76e007e01f073ad9442bc3f#LUtopia-WS/src/web/resources/UtopiaInGameParserResource.javaT60