Need some help figuring out how to deserialize CJSON

Restream / reindexer

Embeddable, in-memory, document-oriented database with a high-level Query builder interface.

https://reindexer.io

Apache License 2.0

763 stars 64 forks source link

Need some help figuring out how to deserialize CJSON #72

Open DecapitatedKneecap opened 3 years ago

slowcheetahzzz commented 3 years ago

Hello! We can try to help you to understand how CJSON works.

The general description of CJSON you can find here: https://github.com/Restream/reindexer/tree/master/cpp_src/core/cjson (readme.md).

This is how cjson tag looks like: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/ctag.h This is how you are supposed to decode it: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/cjsondecoder.cc This is how you should encode it: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/baseencoder.cc This is some magic with 'runtime' updates: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/cjsonmodifier.cc (might look really scary at first)

What you need to know is that there are 2 types of cjson: tuple and 'transportable' cjson. The first one is just like a scheme for an item - it contains a brief description of all the fields (name tag, type number and field number). If field is an index then tuple contains only this kind of information (encoded field index value allows to get field's real value quickly from the real Item), if that field is not an index then fieldTag is -1 and its value is encoded right after this tag in CJSON. So values of non-indexed fields are stored in CJSON. The second type of CJSON is a 'transportable' cjson - we need it to transfer queries' results from one client to another (i.e. network connection or CGO serialization). This type of CJSON encodes each field's value (not just a reference to it by field index) - so it consumes more memory.

Hope it will help you somehow.

It's hard to answer your specific question (not enough information) but you definitely don't need base64 to work with CJSON. We'll be happy to help you with this - you can contact me on Telegram here @slow_cheetah.

Have a good day!

Best wishes, Reindexer team.

slowcheetahzzz commented 3 years ago

Screenshot from 2021-08-06 14-00-37

This looks like an ordinary CJSON of some item - it's perfectly normal.

slowcheetahzzz commented 3 years ago

This is how it is implemented in Golang: https://github.com/Restream/reindexer/tree/master/cjson

slowcheetahzzz commented 3 years ago

ctag and carraytag always have the same size. If tag.field is -1 then field's value is encoded next to it, otherwise comes the next tag (tag of the next field) + CJSON structure is recursive (same as JSON) - it's that simple. It just looks scary. So you first read ctag, then in some cases you read field's value (or just go to the next tag) - do it recursively until TAG_END is read. That's all.

Here is the briefest example of what has been described above:

void skipCjsonTag(ctag tag, Serializer &rdser) {
    const bool embeddedField = (tag.Field() < 0);
    switch (tag.Type()) {
        case TAG_ARRAY: {
            if (embeddedField) {
                carraytag atag = rdser.GetUInt32();
                for (int i = 0; i < atag.Count(); i++) {
                    ctag t = atag.Tag() != TAG_OBJECT ? atag.Tag() : rdser.GetVarUint();
                    skipCjsonTag(t, rdser);
                }
            } else {
                rdser.GetVarUint();
            }
        } break;

        case TAG_OBJECT:
            for (ctag otag = rdser.GetVarUint(); otag.Type() != TAG_END; otag = rdser.GetVarUint()) {
                skipCjsonTag(otag, rdser);
            }
            break;
        default:
            if (embeddedField) rdser.GetRawVariant(KeyValueType(tag.Type()));
    }
}

This is an actual piece of code used in Reindexer when some tag (+its value) needs to be skipped. You don't need TagsMatcher and PayloadType here. Just a simple recursive code.

slowcheetahzzz commented 3 years ago

You want to parse this binary format CJSON like some text string - it makes no sense. I can tell you what, for example, 0006 means to decoder - it is TAG_OBJECT (0x6), and so forth. All you need to do is to dig deeper into CJSON - try to debug it. Create an Item, initialize it from readable JSON and then retrieve its CJSON.

        reindexer::Item item = rx.NewItem(nsName);
        err = item.FromJSON(jsonString);
        if (err.ok()) {
            err = item.GetCJSON(); // here is your cjson
        }

That's how you can play with it - set all possible sets of JSON to get appropriate CJSON. You can't play with it like it is a string that always has some unique patterns. ctag is an int that encodes 3 fields: name, type and field. This combination cannot be unique - merely because name value can be any int (big int), the same is with field, only type field value is limited (TAG_VARINT, TAG_DOUBLE, TAG_STRING, TAG_BOOL, TAG_NULL, TAG_ARRAY, TAG_OBJECT, TAG_END). The problem for text decoding here is that encoding of ctag looks like this:

    int Type() const { return tag_ & ((1 << typeBits) - 1); }
    int Name() const { return (tag_ >> typeBits) & ((1 << nameBits) - 1); }
    int Field() const { return (tag_ >> (typeBits + nameBits)) - 1; }

And the result is just a single integer field - good luck decoding it as a string object, this is definitely not the area where I can help you. All you need is to get CJSON as a byte array in C# and start doing what skipCjsonTag does - read it tag by tag. You read varuint, initialize ctag (to make an equivalent in C# is a piece of cake) from it, get type field and here we are - the type can be whatever from TAG_OBJECT to TAG_END. Or you might go the insane way - read CJSON as a UTF8 string and parse this binary mash-up appropriately, trying to find some patterns there - this will fail anyways.

slowcheetahzzz commented 3 years ago

If you let us know what the goal of your secret mission is, then we'll probably give you better advices.

slowcheetahzzz commented 3 years ago

Ok, clear.

As for varuint, we have this in WrSerializer:

    template <typename T, typename std::enable_if<sizeof(T) == 8 && std::is_integral<T>::value>::type * = nullptr>
    void PutVarUint(T v) {
        grow(10);
        len_ += uint64_pack(v, buf_ + len_);
    }

    template <typename T, typename std::enable_if<sizeof(T) <= 4 && std::is_integral<T>::value>::type * = nullptr>
    void PutVarUint(T v) {
        grow(10);
        len_ += uint32_pack(v, buf_ + len_);
    }

    template <typename T, typename std::enable_if<std::is_enum<T>::value>::type * = nullptr>
    void PutVarUint(T v) {
        assert(v >= 0 && v < 128);
        if (len_ + 1 >= cap_) grow(1);
        buf_[len_++] = v;
    }

I'm not sure how familiar you are with C++ templates magic and SFINAE, but varuint is indeed a variable-length format. It can be encoded differently depending on the actual size of the variable. Functions like uint64_pack, uint32_pack, etc return the size of the encoded variable in bytes - take a look at it, it should help.

slowcheetahzzz commented 3 years ago

As for encoding CJSON non-index fields (values + tags) take a look at CJsonBuilder class and methods like these:

CJsonBuilder &CJsonBuilder::Put(int tagName, int64_t arg) {
    if (type_ == ObjType::TypeArray) {
        itemType_ = TAG_VARINT;
    } else {
        putTag(tagName, TAG_VARINT);
    }
    ser_->PutVarint(arg);
    ++count_;
    return *this;
}

inline void CJsonBuilder::putTag(int tagName, int tagType) { ser_->PutVarUint(static_cast<int>(ctag(tagType, tagName))); }

It can help to understand how bytes are encoded.

slowcheetahzzz commented 3 years ago

As for C++ IDE for Windows you might try to use CLion - it works perfectly well with CMake projects + there is an opportunity to use it for free for the first 30 days (might be enough to accomplish your task).

slowcheetahzzz commented 3 years ago

I'll try to explain CJSON the easiest possible way here. Imagine, you have this Item: {"id":1, "name":"Teddy", "rating":9} - id and name are indexed fields, bonus isn't.

We start it with ctag indicating that cjson has just started:

ser_->PutVarUint(static_cast<int>(ctag(TAG_OBJECT, tagName, -1)));

type field is TAG_OBJECT, tagName is an integer value of fileld name in TagsMatcher, the field value is -1 just because it is not an index - just a marker tag indicating a start of the object.

Then we go to the next field name which is an index:

ser_->PutVarUint(static_cast<int>(ctag(TAG_STRING, kNameTagName, kNameField)));

right here we do not serialize name's value - it is in index and CJSON is just a tuple. In case of sending these results to some friend abroad instead of kNameField there will be -1, also we'll have to add this:

ser_->PutVString(teddyNameValueString);

or simply like this:

ser_->PutVarUint(static_cast<int>(ctag(TAG_STRING, kNameTagName, -1)));
ser_->PutVString(teddyNameValueString);

In case of the last field rating (which is not an index) we do it like this:

ser_->PutVarUint(static_cast<int>(ctag(TAG_VARINT, kRatingTagName, -1)));
ser_->PutVarint(teddyRatingValue);

And the final action is an indicator that we've finished this Item:

ser_->PutVarUint(static_cast<int>(ctag(TAG_END)));

I hope I answered all of your questions and now you understand how to distinguish a field in CJSON buffer.

slowcheetahzzz commented 3 years ago

To make it work in CLion you need to open folder reindexer/cpp_src - it will find CMakelists.txt and prepare the project. To understand how to decode something, you first need to understand how to encode it - at least, this is how we do it in Reindexer and I gave you all the hints for that. MongoDB has BJSON (Binary JSON), we have CJSON (Binary JSON) and it is our implementation but there are analogs, probably. Sorry, I won't explain to you every bit of B41F - with all the information I provided you with, it's more than enough to understand it. There are Decoder examples both in Golang and C++ - you don't need to invent the wheel to make an analog in C#, just rewrite it properly. Good luck with that!