Open amnaredo opened 2 years ago
My workaround was to use org.apache.commons.text.StringEscapeUtils (deprecated version is org.apache.commons.lang3.StringEscapeUtils) to convert the json string (output of render
/write
/toString
methods) to a string without unicode codes.
Either methods unescapeJava
or unescapceJson
worked for my example.
import upickle.default._
import org.apache.commons.text.StringEscapeUtils
object TestJsonEncoding {
def main(args: Array[String]): Unit = {
val lang: String = "ES"
val entityTypeGroupDisplayName: String = "Movilidad eléctrica"
val entityTypeGroupDisplayNameJson: ujson.Js = ujson.read("""{"PT": "Mobilidade Eléctrica"}""")
val expectedJSONString:String = """{"PT":"Mobilidade Eléctrica","ES":"Movilidad eléctrica"}"""
assert(entityTypeGroupDisplayName == "Movilidad eléctrica")
val langsDict = entityTypeGroupDisplayNameJson.obj
langsDict.put(lang, entityTypeGroupDisplayName)
assert(langsDict("ES").str == "Movilidad eléctrica")
assert(entityTypeGroupDisplayNameJson("ES").str =="Movilidad eléctrica")
assert(StringEscapeUtils.unescapeJson(write(entityTypeGroupDisplayNameJson, -1)) == expectedJSONString) // THIS WORKS
assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.render(-1)) == expectedJSONString) // THIS WORKS
assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.toString()) == expectedJSONString) // THIS WORKS
assert(write(entityTypeGroupDisplayNameJson, -1) == expectedJSONString) // THIS FAILS
assert(entityTypeGroupDisplayNameJson.render(-1) == expectedJSONString) // THIS FAILS
assert(entityTypeGroupDisplayNameJson.toString() == expectedJSONString) // THIS FAILS
}
}
import upickle.default._
import org.apache.tika.parser.txt.CharsetDetector
import org.apache.commons.text.StringEscapeUtils
object TestJsonEncodingDetected {
def main(args: Array[String]): Unit = {
val lang: String = "ES"
val entityTypeGroupDisplayName: String = "Movilidad eléctrica"
val entityTypeGroupDisplayNameJson: ujson.Js = ujson.read("""{"PT": "Mobilidade Eléctrica"}""")
val expectedJSONString:String = """{"PT":"Mobilidade Eléctrica","ES":"Movilidad eléctrica"}"""
val detector: CharsetDetector = new CharsetDetector()
println(s"\nentityTypeGroupDisplayName input: ${entityTypeGroupDisplayName}\tencoding: ${detector.setText(entityTypeGroupDisplayName.getBytes()).detect().getName}")
println(s"\nentityTypeGroupDisplayNameJson before adding new element: ${entityTypeGroupDisplayNameJson.render(-1)} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.render(-1).getBytes()).detect().getName}")
assert(entityTypeGroupDisplayName == "Movilidad eléctrica")
val langsDict = entityTypeGroupDisplayNameJson.obj
langsDict.put(lang, entityTypeGroupDisplayName)
println(s"\nlangsDict('ES').str: ${langsDict("ES").str}\tencoding: ${detector.setText(langsDict("ES").str.getBytes()).detect().getName}")
assert(langsDict("ES").str == "Movilidad eléctrica")
println(s"\nentityTypeGroupDisplayNameJson('ES'): ${entityTypeGroupDisplayNameJson("ES").str} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson("ES").str.getBytes()).detect().getName}")
assert(entityTypeGroupDisplayNameJson("ES").str =="Movilidad eléctrica")
println(s"\nentityTypeGroupDisplayNameJson [render method] after adding new element: ${write(entityTypeGroupDisplayNameJson, -1)} \tencoding: ${detector.setText(write(entityTypeGroupDisplayNameJson, -1).getBytes()).detect().getName} \tafter StringUtils.unescapeJson: ${StringEscapeUtils.unescapeJson(write(entityTypeGroupDisplayNameJson, -1))}")
println(s"\nentityTypeGroupDisplayNameJson [write method] after adding new element: ${entityTypeGroupDisplayNameJson.render(-1)} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.render(-1).getBytes()).detect().getName} \tafter StringUtils.unescapeJson: ${StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.render(-1))}")
println(s"\nentityTypeGroupDisplayNameJson [toString method] after adding new element: ${entityTypeGroupDisplayNameJson.toString()} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.toString().getBytes()).detect().getName} \tafter StringUtils.unescapeJson: ${StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.toString())}")
assert(StringEscapeUtils.unescapeJson(write(entityTypeGroupDisplayNameJson, -1)) == expectedJSONString) // THIS WORKS
assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.render(-1)) == expectedJSONString) // THIS WORKS
assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.toString()) == expectedJSONString) // THIS WORKS
assert(write(entityTypeGroupDisplayNameJson, -1) == expectedJSONString) // THIS FAILS
assert(entityTypeGroupDisplayNameJson.render(-1) == expectedJSONString) // THIS FAILS
assert(entityTypeGroupDisplayNameJson.toString() == expectedJSONString) // THIS FAILS
}
}
maven artifact for apache tika (for the detector) and apache commons.text:
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.17</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.3</version>
</dependency>
Original Author: ricardogaspar2
escaped UTF8 characters is valid JSON. Currently it's hardcoded in https://github.com/lihaoyi/upickle/blob/master/ujson/src/ujson/Renderer.scala#L124; feel free to send a PR to soft-code it Original Author: lihaoyi
I see. But is actually a bad behaviour and is not documented that the render/write/toString use another encoding for the input string (neither that the returned string is UTF8 escaped). This methods could have another argument that is a charset so they could return a string in the specified encoding. For now the least you could do is document it and explain how to overcome this issue (feel free to use my workaround). This if you want more people to use your library of course... Original Author: ricardogaspar2
Scala version: 2.12.4 uPickle version: 0.6.5
While I was using uPickle I found that the contents that I was writing/updating to my uJson objects was having problems with encoding. My input is encoded in UTF-8 and my output string (when I render/write/toString my uJSON) is in a different encoding (ISO-8859-1).
My code sample listed (bellow) takes this JSON:
and tries to add a new entry:
"ES": "Movilidad eléctrica"
. the desired output is:My code sample
In this code sample the last three
asserts
fail. If you println them you'll see the output is:Code sample with charset detector
I did another version to debug the problem even more and detect the encoding using Apache Charset detector:
NOTE: for this latter version you need to include the jar file from Apache Tika Parser. I leave her the maven repo I used:
ID: 225 Original Author: ricardogaspar2