DaveGamble / cJSON

Ultralightweight JSON parser in ANSI C
MIT License
10.54k stars 3.18k forks source link

Use \uxxxx(s) to print non-BMP characters #783

Open unbadfish opened 11 months ago

unbadfish commented 11 months ago

Support non-BMP (Basic Multilingual Plane) characters by printing them in \u6789\uabcd sequence.

Background:

The standard RFC 8259 in https://datatracker.ietf.org/doc/html/rfc8259 (December 2017) clearly points out that...

To escape an extended character that is not in the Basic Multilingual Plane (BMP), the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". (in the top part of Page 9)

, and here I copied some sentences from web...

In the Unicode standard, a plane is a continuous group of 65,536 (2^16) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". As of Unicode version 15.0, five of the planes have assigned code points (characters), and seven are named.

Current behavior:

Whatever the length of utf-8 string is (2, 3, or 4), cJson copy the strings from storage memory to output buffer directly.

input:

{
  "1234bytes": "\u0043 h \u0079 Ҁ \u04a2 Ӯ \u9648 厚 \u5c27 𐐝 \ud852\udf62",
  "appear": "C h y Ҁ Ң Ӯ 陈 厚 尧 𐐝 𤭢",
  "hex": "\u0043 \u0068 \u0079 \u0480 \u04a2 \u04ee \u9648 \u539a \u5c27 \ud801\udc1d \ud852\udf62"
}

current output (1.7.16):

{
    "1234bytes":    "C h y Ҁ Ң Ӯ 陈 厚 尧 𐐝 𤭢",
    "appear":   "C h y Ҁ Ң Ӯ 陈 厚 尧 𐐝 𤭢",
    "hex":  "C h y Ҁ Ң Ӯ 陈 厚 尧 𐐝 𤭢"
}

It is worth stating that:

  1. all the strings, as required by the standards, are storaged in utf-8 format in memory;
  2. the files are opened in "rb"/"wb" mode to reduce possible character convert;
  3. Every charcter in C h y are 1 byte in utf-8; while the ones in Ҁ Ң Ӯ are 2 bytes; in 陈 厚 尧, 3bytes; in 𐐝 𤭢, 4bytes.

    Expected behavior

    If a character is not in Basic Multilingual Plane (BMP), the character should be represented as a 12-character sequence, using the UTF-16 surrogate pair, as the standard required.

Expected output:

{
    "1234bytes":    "C h y Ҁ Ң Ӯ 陈 厚 尧 \ud801\udc1d \ud852\udf62",
    "appear":   "C h y Ҁ Ң Ӯ 陈 厚 尧 \ud801\udc1d \ud852\udf62",
    "hex":  "C h y Ҁ Ң Ӯ 陈 厚 尧 \ud801\udc1d \ud852\udf62"
}

Test info

The "current output (1.7.16)" is tested in this simple C-code

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include "../cJSON.h"

int main()
{
    setlocale(LC_ALL, "");
    FILE *fin = NULL, *fout = NULL;
    long flen;
    errno_t err = 0;
    err = fopen_s(&fin, "test/unicodein.json", "rb");
    err = fopen_s(&fout, "test/unicodeout.json", "wb");
    // if don't have _s functions:
    /*fin = fopen("test/unicodein.json", "rb");
    fout = fopen("test/unicodeout.json", "wb");*/
    fseek(fin, 0L, SEEK_END);
    flen = ftell(fin);
    rewind(fin);
    char *container = (char *)calloc(flen + 1, sizeof(char));
    container[flen] = 0;
    fread(container, sizeof(char), flen, fin);
    fclose(fin);
    cJSON *cjson_test = cJSON_Parse(container);
    if (cjson_test == NULL)
    {
        printf("parse fail.\n");
        return -1;
    }
    else
    {
        printf("parse ok.\n");
    }
    char *outstr = cJSON_Print(cjson_test);
    fwrite(outstr, strlen(outstr), sizeof(char), fout);
    // puts(outstr);
    fclose(fout);
    return 0;
}

I beliveve that, this code can be run in nearly every platform whit no, or small change to the file-opening _s functions.

My toolchain: Microsoft Visual Studio 2022 v17.7.3 MSVC v143 Win10 SDK 10.0.19041.0

unbadfish commented 11 months ago

I have drawn a flowchart that shows how my commits work:

flowchart_print

I hope it meet standards correctly.


The following test json file now can print non-BMP characters correctly. input

{
  "1234bytes": "\u0043 h \u0079 Ҁ \u04a2 Ӯ \u9648 厚 \u5c27 𐐝 \ud852\udf62",
  "appear": "C h y Ҁ Ң Ӯ 陈 厚 尧 𐐝 𤭢",
  "hex": "\u0043 \u0068 \u0079 \u0480 \u04a2 \u04ee \u9648 \u539a \u5c27 \ud801\udc1d \ud852\udf62",
  "feel": "\";\\),😊😁!",
  "movement": "\u000b",
  "name": "陈厚尧",
  "school": "Beijing Institute of Technology(BIT)",
  "id": 1120222936,
  "found": 20230903,
  "fix": 20230906,
  "fomat": "yyyymmdd"
}

output:

{
    "1234bytes":    "C h y Ҁ Ң Ӯ 陈 厚 尧 \ud801\udc1d \ud852\udf62",
    "appear":   "C h y Ҁ Ң Ӯ 陈 厚 尧 \ud801\udc1d \ud852\udf62",
    "hex":  "C h y Ҁ Ң Ӯ 陈 厚 尧 \ud801\udc1d \ud852\udf62",
    "feel": "\";\\),\ud83d\ude0a\ud83d\ude01!",
    "movement": "\u000b",
    "name": "陈厚尧",
    "school":   "Beijing Institute of Technology(BIT)",
    "id":   1120222936,
    "found":    20230903,
    "fix":  20230906,
    "fomat":    "yyyymmdd"
}

This test file contains my public personal info to identify that who I am.