DaveGamble / cJSON

Ultralightweight JSON parser in ANSI C
MIT License
10.73k stars 3.21k forks source link

Clean up the cJSON struct #63

Open FSMaxB opened 7 years ago

FSMaxB commented 7 years ago

Currently theres three ways the cJSON struct can be improved:

  1. Get rid of valueint
  2. Use a union for valuestring and valuedouble
  3. Using good names for the elements

If you don't think this is a good idea, please tell me why.

Let me explain:

Getting rid of valueint

Using a union of valuestring and valuedouble

As an instance of the cJSON struct can only be either a string type or a number type, not both at the same time. Therefore they can be stored at the same place in memory.

New version of the cJSON struct

typedef struct cJSON
{
    /* next/prev allow you to walk array/object chains. Alternatively, use GetArraySize/GetArrayItem/GetObjectItem */
    struct cJSON *next;
    struct cJSON *previous;
    /* An array or object item will have a child pointer pointing to a chain of the items in the array/object. */
    struct cJSON *child;
    /* The item's name string, if this item is the child of, or is in the list of subitems of an object. */
    char *name;
    union {
        /* The item's string, if type==cJSON_String */
        char *string;
        /* The item's number, if type==cJSON_Number */
        double number;
    }
    /* The type of the item, as above. */
    int type;
} cJSON;
DaveGamble commented 7 years ago

With fairly minimal rework, one could eliminate previous too. It's barely ever read in the library, and when it is, a single temp variable could eliminate it.

FSMaxB commented 7 years ago

Yes, now that I think about it the library only ever uses it as single linked list.

iMobs commented 7 years ago

You might also want to play around with the order of the variables for better struct packing. Depending on the word/pointer size of the system, this could reduce the size of the object.

FSMaxB commented 7 years ago

There are potentially 3 different sizes in the new struct. The size of a pointer, sizeof(int) and sizeof(double). The order I used in the suggested version of the new struct already optimizes that for x86_64 with still being the best version for 32 bit x86. But I haven't considered all the other possible size combinations on other processors.

This would change when removing the previous pointer though.

FSMaxB commented 7 years ago

With previous removed, the following variant has no padding on all architectures I can think of:

typedef struct cJSON
{
    union {
        char *string;
        double number;
    }
    char *name;
    struct cJSON *next;
    struct cJSON *child;
    int type;
} cJSON;
ffontaine commented 7 years ago

I would suggest to give a name to the union such as data or u indeed anonymous structures and unions are a C11 feature.

FSMaxB commented 7 years ago

I consider this C11 compatible (if even) pseudo code XD. That shows you that I never put this code through a compiler. But correct syntax doesn't matter for the purpose of this discussion.

ffontaine commented 7 years ago

Sorry, I think my comment was incomplete and didn't give you enough background. I wanted to point out that I would like to keep cjson compatible with C99 and to avoid any C11 features such as anonymous structures. Indeed some old gcc like gcc 4.3 (which is used to compile cjson for blackfin target on buildroot) will not be able to compile cjson anymore.

FSMaxB commented 7 years ago

Actually, cJSON is keeping compatible with C89 and avoids any C99 and C11 features (that haven't been available in C89). Don't worry, this GitHub issue is not cJSONs source code!

Dreaster commented 7 years ago

How about changing the type to be an enum instead so that the the code will be clearer?

FSMaxB commented 7 years ago

That sounds like a really good idea, but I thought that it is undefined behavior to use an enum with another than the predefined values. Let me check if that is actually true.

Because type is supposed to be used as bit flags, so you can check if something is array or object for example (e.g. if (json->type & (cJSON_Array | cJSON_Object)).

Dreaster commented 7 years ago

You are very correct. I have completly missed that the type was used as bit flags. I thought that it truly was one of the cJSON_types defines. In that case. A different name perhaps inroder to clairify. "used_types" or something like that.

FSMaxB commented 7 years ago

Renaming it to typeinfo instead of type might actually be a good idea, because it not only stores the type, but also cJSON_IsReference and cJSON_IsConst.

Another possible solution: Make it a bitfield, so cJSON_IsReference and cJSON_IsConst are not part of the type.

Also cJSON_False, cJSON_True ... can be declared ad const int instead of using a preprocessor macro.

So my current suggestion based on using a bitfield:

typedef struct cJSON
{
    union {
        char *string;
        double number;
    } data; /* the union is now named `data` */
    char *name;
    struct cJSON *next;
    struct cJSON *child;
    unsigned int type : 12; /* 12 bits allow for 5 more types to be added later */
    int is_reference : 1;
    int is_const : 1;
    /* reserve the remaining two bits, not sure if this should be done or not */
    int unused1 : 1;
    int unused2 : 1;
 } cJSON;

Notes:

FSMaxB commented 7 years ago

Using a bitfield fixes ->type == cJSON_XXXX if a node is a reference or contains a constant string. So it is not necessary to use & 0xFF anymore when comparing.

FSMaxB commented 7 years ago

Defining the types as const instead of #define doesn't work because it is not supported in switch case.

FSMaxB commented 7 years ago

Well, let's change all of the bit fields from int to unsigned int. For this reason!

FSMaxB commented 7 years ago

In order to be fully JSON conformant, cJSON needs to support '\0' in strings. This requires storing the length for name and string. Also it should be possible to use an enum for the type.

I also think that cJSON_False and cJSON_True should not be separate types. There should be a boolean type instead and an added value in the data union.

cJSON_Raw can be removed and replaced by a flag in the bitfield. This allows specifying the type of raw JSON. But this also means that an unknown type needs to be added. Although this increases the potential of misuse (because you cannot rely on just the type anymore to determine what union member to access).

New draft of the new cJSON struct (also let's rename it to cJSON_item):

enum cJSON_Type
{ 
    CJSON_INVALID = 0,
    CJSON_BOOLEAN = 1,
    CJSON_NULL = 2,
    CJSON_NUMBER = 4,
    CJSON_STRING = 8,
    CJSON_ARRAY = 16,
    CJSON_OBJECT = 32,
    CJSON_UNKNOWN = 64
};

typedef struct cJSON_string
{
    unsigned char *data;
    size_t length;
}

typedef struct cJSON_item
{
    union 
    {
        cJSON_string string;
        double number;
        cJSON_bool boolean;
    } data; /* the union is now named `data` */
    cJSON_string name;
    struct cJSON_item *next;
    struct cJSON_item *child;
    enum cJSON_type type : 12; /* 12 bits allow adding 3 types later, not sure about the 4th because enum is signed int and it behaves weird in bit fields, maybe use unsigned int here instead of enum */
    unsigned int read_only_name : 1;
    unsigned int read_only_string : 1;
    unsigned int read_only_struct : 1; /* the cJSON struct itself should not be deallocated, useful for cJSON_PrintPreallocated */
    unsigned int raw : 1;
 } cJSON_item;

Although cJSON_string has a length property, data should always be length + 1 bytes where the last one is '\0'. Just as a precaution if somebody feeds this to a function that expects zero terminated strings.

Because unions are dangerous, macros and functions should exist for accessing everything, hiding the implementation details.

FSMaxB commented 6 years ago

I thought of a better idea than using a union. Since cJSON structs are not stored in arrays, a dynamicly sized struct can be used. There can be a base struct called cJSON with type information and a next pointer for the linked list. Other types would then have a cJSON as their first member and add additional data after that.

This would have mostly two benefits:

Example:

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>

typedef int cJSON_bool;

typedef enum cJSON_Type {
    CJSON_INVALID = 0,
    CJSON_BOOLEAN = 1,
    CJSON_NULL = 2,
    CJSON_NUMBER = 3,
    CJSON_STRING = 4,
    CJSON_ARRAY = 5,
    CJSON_OBJECT = 6
} cJSON_Type;

typedef enum cJSON_NumberType {
    CJSON_NUMBER_DOUBLE = 0,
    CJSON_NUMBER_INT = 1,
    CJSON_NUMBER_LONG = 2
} cJSON_NumberType;

typedef struct cJSON_string {
    const char *string;
    size_t length;
} cJSON_string;

typedef struct cJSON {
    struct cJSON *next;
    cJSON_string key;
    cJSON_Type type : 4; /* leaves room for 9 more types */
    unsigned int is_reference : 1;
    unsigned int key_is_reference : 1;
    unsigned int data_is_reference : 1;
    unsigned int is_raw : 1;
    cJSON_NumberType number_type : 2;
    unsigned int reserved : 6;
} cJSON;

typedef cJSON cJSON_Null;

typedef struct cJSON_Boolean {
    cJSON base;
    cJSON_bool value : 8;
} cJSON_Boolean;

typedef struct cJSON_Array {
    cJSON base;
    cJSON *child;
} cJSON_Array;

typedef cJSON_Array cJSON_Object;

typedef struct cJSON_String {
    cJSON base;
    cJSON_string value;
} cJSON_String;

typedef cJSON_String cJSON_Raw;

typedef struct cJSON_Number {
    cJSON base;
    union {
        double double_value;
        int integer;
        long int long_integer;
    };
} cJSON_Number;

int main(void) {
    printf("sizeof(cJSON) = %zu\n", sizeof(cJSON));
    printf("sizeof(cJSON_Boolean) = %zu\n", sizeof(cJSON_Boolean));
    printf("sizeof(cJSON_Array) = %zu\n", sizeof(cJSON_Array));
    printf("sizeof(cJSON_Object) = %zu\n", sizeof(cJSON_Object));
    printf("sizeof(cJSON_String) = %zu\n", sizeof(cJSON_String));
    printf("sizeof(cJSON_Raw) = %zu\n", sizeof(cJSON_Raw));
    printf("sizeof(cJSON_Number) = %zu\n", sizeof(cJSON_Number));
    printf("sizeof(cJSON_Null) = %zu\n", sizeof(cJSON_Null));

    return EXIT_SUCCESS;
}

On x86_64 the sizes are:

sizeof(cJSON) = 32
sizeof(cJSON_Boolean) = 40
sizeof(cJSON_Array) = 40
sizeof(cJSON_Object) = 40
sizeof(cJSON_String) = 48
sizeof(cJSON_Raw) = 48
sizeof(cJSON_Number) = 40
sizeof(cJSON_Null) = 32

The names are not final.

I think this change would be especially beneficial for environments where heap space is limited.

FSMaxB commented 6 years ago

Also note that I added the possibility of having int or long int numbers, since there have been many requests to support scenarios where double is not available.

me-axentia commented 4 years ago

What happened to this? Its been more than 2 years and I can see it is planned for 2.00 but will it ever happen? Reducing the size of cJSON would be very nice for embedded systems, like the ones I'm developing. I would like to be able to handle quite large datastructures with as low memory footprint as possible. Just making a union of the data, removing the previous pointer and making a bitfiled of the type would go a long way. I know it would break the ABI, but is that really a big problem?

Alanscut commented 4 years ago

indeed, It seems that there are often cases where embedded systems fail due to insufficient memory, reducing the lib size really makes sense for embeded systems. But as Dave said https://github.com/DaveGamble/cJSON/issues/16#issuecomment-228606471, it increases the complexity of the lib, while increasing the user's debugging costs.

Nevertheless, I will try my best to continue Milestone 2.0.0, it will take some time. If you have a good solution, welcome to submit a PR.

me-axentia commented 4 years ago

Ok, I see. I have not started working with cJSON yet, but I plan to do it within a few months. As you say these changes may not be welcomed by everyone as they involve some trade-offs. I don't know if the changes I would like to see are suitable for everyone. I will probably start of using cJSON as it is but may be required to make my own fork optimized for embedded systems in general and my case in particular.

oneonce commented 4 years ago

char *string; “string” is the keyword/type of C++, recommend use str replace, or others

maximkulkin commented 4 years ago

@oneonce No, it's not. C++ has no keyword "string" and the standard string type is in "std" namespace.

RudyFisher7 commented 8 months ago

Holy cow, this issue has been waiting a while. Is cJSON even being maintained at this point? Or should I go elsewhere? (There are also a ton of really old open issues. Not very confidence inspiring, no offense....)

iMobs commented 8 months ago

Holy cow, this issue has been waiting a while. Is cJSON even being maintained at this point? Or should I go elsewhere? (There are also a ton of really old open issues. Not very confidence inspiring, no offense....)

@RudyFisher7 What are you talking about, there was a release in December. This is an open source project run by volunteers, don't judge the health of it based on if there are old discussions of a feature that never got resolved. It's a bad metric.

RudyFisher7 commented 8 months ago

No need to get defensive. If you actually look at the issues, more than half of them ARE bug reports actually, quite a few dating a few years back and still marked as open. (Names starting with _ are not found, inconsistent error reporting, etc.) If they WERE all simply enhancements I would of course agree with you. Also, the latest version is a 1.7.x version, just FYI. I guess I typically find, even community driven, repositories resolving bugs with new releases, but they remain open here, so raised a flag for me.

FSMaxB commented 8 months ago

I did give up maintaining it because of burnout a long time ago, but I've seen @Alanscut do some maintenance work since then.

RudyFisher7 commented 7 months ago

That's understandable.

mbratch commented 7 months ago

Actually, I like having valueint. Some embedded platforms don't do a good job of accurately representing the their largest int with a double. Just recently I had to patch our copy of cJSON to render the ints directly because rendering to double then assigning to int internally was giving inaccurate int values. Possibly the fault of the compiler, but the do ect int parse fixed it.