amnaredo / test

0 stars 0 forks source link

String enconding (UTF-8) has problems #223

Open amnaredo opened 2 years ago

amnaredo commented 2 years ago

Scala version: 2.12.4 uPickle version: 0.6.5

While I was using uPickle I found that the contents that I was writing/updating to my uJson objects was having problems with encoding. My input is encoded in UTF-8 and my output string (when I render/write/toString my uJSON) is in a different encoding (ISO-8859-1).

My code sample listed (bellow) takes this JSON:

{
"PT": "Mobilidade Eléctrica"
}

and tries to add a new entry: "ES": "Movilidad eléctrica". the desired output is:

{
"PT": "Mobilidade Eléctrica",
"ES": "Movilidad eléctrica"
}

My code sample

In this code sample the last three asserts fail. If you println them you'll see the output is:

{
    "PT": "Mobilidade El\u00e9ctrica",
    "ES": "Movilidad el\u00e9ctrica"
}
import upickle.default._
object Main {

  def main(args: Array[String]): Unit = {

    val lang: String = "ES"
    val entityTypeGroupDisplayName: String = "Movilidad eléctrica"

    val entityTypeGroupDisplayNameJson: ujson.Js = ujson.read("""{"PT": "Mobilidade Eléctrica"}""")

    val expectedJSONString:String = """{"PT":"Mobilidade Eléctrica","ES":"Movilidad eléctrica"}"""

    assert(entityTypeGroupDisplayName == "Movilidad eléctrica")

    val langsDict = entityTypeGroupDisplayNameJson.obj
    langsDict.put(lang, entityTypeGroupDisplayName)

    assert(langsDict("ES").str == "Movilidad eléctrica")
    assert(entityTypeGroupDisplayNameJson("ES").str =="Movilidad eléctrica")

    assert(write(entityTypeGroupDisplayNameJson, -1) == expectedJSONString) // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.render(-1) == expectedJSONString)  // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.toString() == expectedJSONString)  // THIS FAILS

  }
}

Code sample with charset detector

I did another version to debug the problem even more and detect the encoding using Apache Charset detector:


import upickle.default._
import org.apache.tika.parser.txt.CharsetDetector

object TestJsonEncodingDetected {

  def main(args: Array[String]): Unit = {

    val lang: String = "ES"
    val entityTypeGroupDisplayName: String = "Movilidad eléctrica"

    val entityTypeGroupDisplayNameJson: ujson.Js = ujson.read("""{"PT": "Mobilidade Eléctrica"}""")

    val expectedJSONString:String = """{"PT":"Mobilidade Eléctrica","ES":"Movilidad eléctrica"}"""

    val detector: CharsetDetector = new CharsetDetector()
    println(s"\nentityTypeGroupDisplayName input: ${entityTypeGroupDisplayName}\tencoding: ${detector.setText(entityTypeGroupDisplayName.getBytes()).detect().getName}")
    println(s"\nentityTypeGroupDisplayNameJson before adding new element: ${entityTypeGroupDisplayNameJson.render(2)} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.render(2).getBytes()).detect().getName}")
    assert(entityTypeGroupDisplayName == "Movilidad eléctrica")

    val langsDict = entityTypeGroupDisplayNameJson.obj
    langsDict.put(lang, entityTypeGroupDisplayName)

    println(s"\nlangsDict('ES').str: ${langsDict("ES").str}\tencoding: ${detector.setText(langsDict("ES").str.getBytes()).detect().getName}")
    assert(langsDict("ES").str == "Movilidad eléctrica")
    println(s"\nentityTypeGroupDisplayNameJson('ES'): ${entityTypeGroupDisplayNameJson("ES").str} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson("ES").str.getBytes()).detect().getName}")
    assert(entityTypeGroupDisplayNameJson("ES").str =="Movilidad eléctrica")

    println(s"\nentityTypeGroupDisplayNameJson [render method] after adding new element: ${write(entityTypeGroupDisplayNameJson, 2)} \tencoding: ${detector.setText(write(entityTypeGroupDisplayNameJson, 2).getBytes()).detect().getName}")
    println(s"\nentityTypeGroupDisplayNameJson [write method] after adding new element: ${entityTypeGroupDisplayNameJson.render(2)} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.render(2).getBytes()).detect().getName}")
    println(s"\nentityTypeGroupDisplayNameJson [toString method] after adding new element: ${entityTypeGroupDisplayNameJson.toString()} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.toString().getBytes()).detect().getName}")
    assert(write(entityTypeGroupDisplayNameJson, -1) == expectedJSONString)  // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.render(-1) == expectedJSONString)  // THIS FAILS 
    assert(entityTypeGroupDisplayNameJson.toString() == expectedJSONString)  // THIS FAILS

  }
}

NOTE: for this latter version you need to include the jar file from Apache Tika Parser. I leave her the maven repo I used:

        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.17</version>
        </dependency>

ID: 225 Original Author: ricardogaspar2

amnaredo commented 2 years ago

My workaround was to use org.apache.commons.text.StringEscapeUtils (deprecated version is org.apache.commons.lang3.StringEscapeUtils) to convert the json string (output of render/write/toString methods) to a string without unicode codes.

Either methods unescapeJava or unescapceJson worked for my example.

Version with asserts only

import upickle.default._
import org.apache.commons.text.StringEscapeUtils

object TestJsonEncoding {

  def main(args: Array[String]): Unit = {

    val lang: String = "ES"
    val entityTypeGroupDisplayName: String = "Movilidad eléctrica"

    val entityTypeGroupDisplayNameJson: ujson.Js = ujson.read("""{"PT": "Mobilidade Eléctrica"}""")

    val expectedJSONString:String = """{"PT":"Mobilidade Eléctrica","ES":"Movilidad eléctrica"}"""

    assert(entityTypeGroupDisplayName == "Movilidad eléctrica")

    val langsDict = entityTypeGroupDisplayNameJson.obj
    langsDict.put(lang, entityTypeGroupDisplayName)

    assert(langsDict("ES").str == "Movilidad eléctrica")
    assert(entityTypeGroupDisplayNameJson("ES").str =="Movilidad eléctrica")

    assert(StringEscapeUtils.unescapeJson(write(entityTypeGroupDisplayNameJson, -1)) == expectedJSONString)  // THIS WORKS
    assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.render(-1)) == expectedJSONString)  // THIS WORKS
    assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.toString()) == expectedJSONString)  // THIS WORKS

    assert(write(entityTypeGroupDisplayNameJson, -1) == expectedJSONString)  // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.render(-1) == expectedJSONString)  // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.toString() == expectedJSONString)  // THIS FAILS

  }
}

Version with apache encoding charset detector, asserts and printlns


import upickle.default._
import org.apache.tika.parser.txt.CharsetDetector
import org.apache.commons.text.StringEscapeUtils

object TestJsonEncodingDetected {

  def main(args: Array[String]): Unit = {

    val lang: String = "ES"
    val entityTypeGroupDisplayName: String = "Movilidad eléctrica"

    val entityTypeGroupDisplayNameJson: ujson.Js = ujson.read("""{"PT": "Mobilidade Eléctrica"}""")

    val expectedJSONString:String = """{"PT":"Mobilidade Eléctrica","ES":"Movilidad eléctrica"}"""

    val detector: CharsetDetector = new CharsetDetector()
    println(s"\nentityTypeGroupDisplayName input: ${entityTypeGroupDisplayName}\tencoding: ${detector.setText(entityTypeGroupDisplayName.getBytes()).detect().getName}")
    println(s"\nentityTypeGroupDisplayNameJson before adding new element: ${entityTypeGroupDisplayNameJson.render(-1)} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.render(-1).getBytes()).detect().getName}")
    assert(entityTypeGroupDisplayName == "Movilidad eléctrica")

    val langsDict = entityTypeGroupDisplayNameJson.obj
    langsDict.put(lang, entityTypeGroupDisplayName)

    println(s"\nlangsDict('ES').str: ${langsDict("ES").str}\tencoding: ${detector.setText(langsDict("ES").str.getBytes()).detect().getName}")
    assert(langsDict("ES").str == "Movilidad eléctrica")
    println(s"\nentityTypeGroupDisplayNameJson('ES'): ${entityTypeGroupDisplayNameJson("ES").str} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson("ES").str.getBytes()).detect().getName}")
    assert(entityTypeGroupDisplayNameJson("ES").str =="Movilidad eléctrica")

    println(s"\nentityTypeGroupDisplayNameJson [render method] after adding new element: ${write(entityTypeGroupDisplayNameJson, -1)} \tencoding: ${detector.setText(write(entityTypeGroupDisplayNameJson, -1).getBytes()).detect().getName} \tafter StringUtils.unescapeJson: ${StringEscapeUtils.unescapeJson(write(entityTypeGroupDisplayNameJson, -1))}")
    println(s"\nentityTypeGroupDisplayNameJson [write method] after adding new element: ${entityTypeGroupDisplayNameJson.render(-1)} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.render(-1).getBytes()).detect().getName} \tafter StringUtils.unescapeJson: ${StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.render(-1))}")
    println(s"\nentityTypeGroupDisplayNameJson [toString method] after adding new element: ${entityTypeGroupDisplayNameJson.toString()} \tencoding: ${detector.setText(entityTypeGroupDisplayNameJson.toString().getBytes()).detect().getName} \tafter StringUtils.unescapeJson: ${StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.toString())}")

    assert(StringEscapeUtils.unescapeJson(write(entityTypeGroupDisplayNameJson, -1)) == expectedJSONString)  // THIS WORKS
    assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.render(-1)) == expectedJSONString)  // THIS WORKS
    assert(StringEscapeUtils.unescapeJson(entityTypeGroupDisplayNameJson.toString()) == expectedJSONString)  // THIS WORKS

    assert(write(entityTypeGroupDisplayNameJson, -1) == expectedJSONString)  // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.render(-1) == expectedJSONString)  // THIS FAILS
    assert(entityTypeGroupDisplayNameJson.toString() == expectedJSONString)  // THIS FAILS

  }
}

maven artifact for apache tika (for the detector) and apache commons.text:

        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.17</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.3</version>
        </dependency>

Original Author: ricardogaspar2

amnaredo commented 2 years ago

escaped UTF8 characters is valid JSON. Currently it's hardcoded in https://github.com/lihaoyi/upickle/blob/master/ujson/src/ujson/Renderer.scala#L124; feel free to send a PR to soft-code it Original Author: lihaoyi

amnaredo commented 2 years ago

I see. But is actually a bad behaviour and is not documented that the render/write/toString use another encoding for the input string (neither that the returned string is UTF8 escaped). This methods could have another argument that is a charset so they could return a string in the specified encoding. For now the least you could do is document it and explain how to overcome this issue (feel free to use my workaround). This if you want more people to use your library of course... Original Author: ricardogaspar2