francoispqt / gojay

high performance JSON encoder/decoder with stream API for Golang
MIT License
2.11k stars 112 forks source link

Is there a way to get the raw bytes with the decoder + unmarshal interface #119

Open marcsantiago opened 5 years ago

marcsantiago commented 5 years ago

I have a complex data structure, for which I have all the interfaces defined to satisfy the UnmarshalerJSONObject interface. There is one object within this complex structure which is currently being handled by the standard lib UnmarshalJSON interface, where I mutated/normalize the bytes before unmarshaling. I cannot use UnmarshalerJSONObject for this object because UnmarshalJSONObject(dec *gojay.Decoder, k string) does not expose the bytes

I need a way to see if the value for the k is a string or []string, if it's a string I need to change the byte structure to be a []string

Example of the UnmarshalJSON I use which does not work well with UnmarshalerJSONObject

func (a *Foo) UnmarshalJSON(b []byte) error {
    var rawData map[string]interface{}
    err := json.Unmarshal(b, &rawData)
    if err != nil {
        return nil
    }

    writeJSON := func(buf *bytes.Buffer, key string, value []interface{}) {
        buf.WriteString(`"` + key + `":[`)
        for i, v := range value {
            buf.WriteString(`"` + v.(string) + `"`)
            if i+1 < len(value) {
                buf.WriteByte(',')
            }
        }
        buf.WriteString(`]`)
    }

    // allocate the buffer upfront
    buf := bytes.NewBuffer(make([]byte, 0, len(b)))
    buf.WriteByte('{')

    i, keysN := 1, len(rawData)
    for key, value := range rawData {
        switch rawData[key].(type) {
        case []interface{}:
            writeJSON(buf, key, value.([]interface{}))
        case string:
            // handle the case where the SDK sends seperated values
            parts := strings.Split(value.(string), ",")
            if len(parts) == 1 && len(parts[0]) == 0 {
                parts = []string{}
            }

            // create an interface slice for the method, for the most part this will always be a slice of 1
            slice := make([]interface{}, len(parts))
            for i := 0; i < len(parts); i++ {
                slice[i] = parts[i]
            }
            writeJSON(buf, key, slice)
        }
        if i < keysN {
            buf.WriteByte(',')
            i++
        }
    }
    buf.WriteByte('}')

    // avoid infinite recursion, create a type alias
    type temp Foo
    var tempFoo temp
    err = json.Unmarshal(buf.Bytes(), &tempFoo)
    if err != nil {
        return nil
    }
    // mutate a
    *a = Foo(tempFoo)
    return nil
}

^ oh as an FYI on this example, nil error returns are on purpose. If this object fails to unmarshal, it shouldn't break all the other objects in the complex structure this belongs to.

Within the UnmarshalerJSONObject I use the dec.Array as the data structure foo contains fields that are all of type []string

However, the value of the data can either be a single string, or a comma separated string, or an array. My custom unmarshaler handles all those permutations and ensures everything is of type []string to avoid a structure where the value is of type interface{}.

within the context of (s *Foo) UnmarshalJSONObject(dec *gojay.Decoder, k string) each value is defined to spec

switch k {
    case "bar":
        var aSlice = Strings{}
        err := dec.Array(&aSlice)
        if err == nil && len(aSlice) > 0 {
            s.Bar = []string(aSlice)
        }
        return err
....
}

however dec.Array(&aSlice) doesn't allow there to be the chance that the data is of type string. I've tried calling dec.String() first and then following back if err != nil to dec.Array(), but calling String() moves the reader forward and skips "bad" data, therefore calling dec.Array() after fails. Calling dec.Array() on a string type also fails with a non catchable error invalidUnmarshalErrorMsg, which is not bubbled up to err := dec.Array(&aSlice), which means one can't simply call dec.String() after. And because I haven't found a way to work with the bytes or call UnmarshalJSON within UnmarshalJSONObject I can't get the performance boost from calling

decoder := gojay.BorrowDecoder(reader)
defer decoder.Release()
err = decoder.DecodeObject(&v)

Because the data will be invalid for object Foo as a result of not being able to handle string types.

That also means I don't gain a real performance boost when calling

decoder := gojay.BorrowDecoder(reader)
defer decoder.Release()
err = decoder.Decode(&v)

Which uses the std lib as the the code always hits the case

case *interface{}:
    err = dec.decodeInterface(vt)

which uses the std lib underneath

if err = json.Unmarshal(object, i); err != nil {
  return err
}

Any thoughts?

BorisKozo commented 5 years ago

I was just going to post the same issue when I saw yours. I am trying to migrate my code from JSON Iterator to GoJay except in my case all of the objects are sent in the way you describe. My objects often have a "type" field and a "content" field. I am unmarshaling only the "type" field then decide which further processing is required, then I am sending the bytes of the "content" filed to the proper code that handles this type of messages. I don't understand how to do it in GoJay without sending the entire original JSON data to each function.

francoispqt commented 5 years ago

Hey, sorry for the latency!

So in the end, what you want to do is to unmarshal a value that could be either a json array of strings or a comma separated string into a go []string.

With the current state of gojay, I only see one solution, first unmarshal that value to a gojay.EmbeddedJSON, then check if first char is [ or " and then do the unmarshaling accordingly.

Example:

func (f *Foo) UnmarshalJSONObject(dec *gojay.Decoder, k string) error {
    switch k {
       case "yourkey": 
           eb := make(gojay.EmbeddedJSON, 0, 128)
           if err := dec.EmbeddedJSON(&eb); err != nil {
                return err
           }
           switch eb[0] {
                case '"':
                // decode string, then split it
                var s string
                if err := gojay.Unmarshal(eb, &s); err != nil {
                         return err
                }
                f.V = strings.Split(s, ",")
                case '[':
                // decode array
                var aSlice = Strings{}
        err := gojay.Unmarshal(eb, &aSlice)
        if err == nil && len(aSlice) > 0 {
            s.Bar = []string(aSlice)
        }
        return err
           }
    }
}

We could also add some methods to the decoder to tell what's the next data in the buffer. Something like:

switch dec.NextToken() {
    case gojay.TokenArray:
    case gojay.TokenString:
}

Let me know what you think

BorisKozo commented 5 years ago

What I actually want is not to unmarshal a field but leave it as bytes (or other internal type). Then send only this field to another decoder separately.

Here is my current code (there is another issue there that I unmarshal twice, I am going to fix it)

    header := DataLayer.MessageHeader{}
    var data map[string]jsoniter.RawMessage
    err := Json.JsonPaser.Unmarshal(message, &data)
    if err != nil {
        Log.Error(err)
        return nil, nil
    }

    err = Json.JsonPaser.Unmarshal(message, &header)
    if err != nil {
        Log.Error(err)
        return nil, nil
    }

    return &header, data["content"]

The data is something like :

{
  "messageId":1,
  "sender" : "sender name",
 "type": "order",
 "content":{
    //order details
 }
}

I can unmarshal just the first 3 fields and according to the type I can send the content bytes to the order handling function that will unmarshal it separately.

This is how I do it... Not saying this is the best way but I cannot easily switch to GoJay because I cannot do this with GoJay.

marcsantiago commented 5 years ago

For my situation, this may work very well! Going to give it a try today. I justed needed a way to normalize the data types to [], which you saw I was doing with raw bytes before.

On Wed, Jul 17, 2019, 12:47 AM Francois Parquet notifications@github.com wrote:

Hey, sorry for the latency!

So in the end, what you want to do is to unmarshal a value that could be either a json array of strings or a comma separated string into a go []string.

With the current state of gojay, I only see one solution, first unmarshal that value to a gojay.EmbeddedJSON, then check if first char is [ or " and then do the unmarshaling accordingly.

Example:

func (f Foo) UnmarshalJSONObject(dec gojay.Decoder, k string) error { switch k { case "yourkey": eb := make(gojay.EmbeddedJSON, 0, 128) if err := dec.EmbeddedJSON(&eb); err != nil { return err } switch eb[0] { case '"': // decode string, then split it var s string if err := gojay.Unmarshal(eb, &s); err != nil { return err } f.V = strings.Split(s, ",") case '[': // decode array var aSlice = Strings{} err := gojay.Unmarshal(eb, &aSlice) if err == nil && len(aSlice) > 0 { s.Bar = []string(aSlice) } return err } } }

We could also add some methods to the decoder to tell what's the next data in the buffer. Something like:

switch dec.NextToken() { case gojay.TokenArray: case gojay.TokenString: }

Let me know what you think

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/francoispqt/gojay/issues/119?email_source=notifications&email_token=AB4Z3XG3VM2YKA3A3FFMTULP72P4HA5CNFSM4H7AUZYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DAPTI#issuecomment-512100301, or mute the thread https://github.com/notifications/unsubscribe-auth/AB4Z3XCO3N2NV4J66EL6GJTP72P4HANCNFSM4H7AUZYA .

marcsantiago commented 5 years ago

@francoispqt given that the data can be very dynamic in terms of size, the first solution can be very hard to use

eb := make(gojay.EmbeddedJSON, 0, 128) // 128 best guess?

If the string or array has an unknown size. It can be very easy to under or over allocate. I really like the idea of adding,

switch dec.NextToken() {
    case gojay.TokenArray:
    case gojay.TokenString:
}

it keeps it very simple without adding too much complexity, example below. What do you think?

switch dec.NextToken() {
    case gojay.TokenArray:
        var aSlice = Strings{}
        dec.Array(&aSlice)
        s.Bar = []string(aSlice)
    case gojay.TokenString:
        var s string
        dec.String(&s)
        aSlice := string.Split(s, ",")
        dec.Array(&aSlice)
        s.Bar = []string(aSlice)
}

because then we could just use the underline logic of dec.Array(&aSlice)

marcsantiago commented 5 years ago

@francoispqt Was wondering if anymore thought was given to

switch dec.NextToken() {
    case gojay.TokenArray:
    case gojay.TokenString:
}