Closed benluddy closed 5 months ago
Hi @benluddy, thanks for opening this issue!
Yes, I agree the current interface of TagSet doesn't support this.
I'm open to extending TagSet without breaking backward compatibility. Adding decoding and encoding option would also work.
I'd need to look into this in order to have a preference. Do you have a preference between extending TagSet or adding decoding/encoding options?
I've looked at this problem more and no longer think tags are sufficient to reach drop-in compatibility with encoding/json
when dealing with []byte
.
Say that CBOR is configured to:
[]byte
with tag 22interface{}
[]byte
This gives JSON-compatible results for:
[]byte
-to-CBOR-to-[]byte
(this path is compatible today)[]byte
-to-CBOR-to-interface{}
[]byte
-to-CBOR-to-string
...but not for string
-to-CBOR-to-[]byte
. If a client uses JSON to serialize map[string]interface{"Bytes":"aGVsbG8gd29ybGQ="}
and send it to a server that decodes into a struct{Bytes []byte}
, the server will see []byte("hello world")
in the Bytes field. The output of a CBOR encoder dropped into the same client would be read by the server as []byte("aGVsbG8gd29ybGQ=")
.
So there would also need to be a decode option that assumes untagged CBOR strings contain base64-encoded data when decoding into a []byte
. Strings with tag 22 would not need to be decoded, so the option would depend on tag 22 being one of the built-in tags.
Hi @fxamacker, there were enough details to consider here that I went ahead and implemented a POC (https://github.com/fxamacker/cbor/pull/476). It ended up fairly close to what I described in my last comment. Please take a look when you're able. I'd like to arrive at an approach you're happy with before implementing full test coverage in my branch. This is the gist of the approach in the POC:
A CBOR encoder that is aware of the text format it will interoperate with can configure any (or none) of the expected later encoding tags to be automatically applied whenever a Go []byte
is encoded to byte string (e.g. []byte("hello world")
might encode as 22('hello world')
. This is controlled by a new encode option, ByteSliceMode
.
The same CBOR encoder might also be asked encode an interface{}
that was itself the output of a text format decoder, like encoding/json
. Any Go string might have originally been produced by applying a text-encoding to a []byte
(e.g. []byte("hello world")
might encode to the JSON string "aGVsbG8gd29ybGQ="
and decode back to the Go string "aGVsbG8gd29ybGQ="
). The CBOR encoder has no way to recognize that a Go string in its input represents, as in the example, the base64 encoding of []byte("hello world")
.
A corresponding decoder needs to be able to handle the CBOR produced in both of the above cases appropriately whether decoding into a []byte
or a string
. To interoperate with encoding/json
across struct and interface{}
values, the desired decoder behavior is:
CBOR | destination type | destination value | conversion |
---|---|---|---|
22('hello world') | string | "aGVsbG8gd29ybGQ=" | encode |
'hello world' | string | "hello world" | none |
22('hello world') | []byte | []byte("hello world") | none |
'aGVsbG8gd29ybGQ=' | []byte | []byte("hello world") | decode |
This is made configurable with two proposed decode options. First, TextConversions func(reflect.Type) TextConversionMode
, which selects a text conversion (encode, decode, or none) based on destination type. Second, DefaultTextEncoding TextEncoding
, which specifies a particular text encoding (base64url, base64, base16, or none) to assume for untagged byte strings when the text conversion mode is decode.
The test in the POC roundtrips from []byte to interface{} and back using both CBOR (using the configuration described above) and with encoding/json, verifying that the intermediate interface{} values in both cases are identical to each other and that the final values in both cases are identical to the original value.
Thanks again for the detailed write up! I shared some thoughts in PR #476.
The draft PR and round-trip tests were really helpful! :+1:
Thanks Ben! Closed by #476.
Is your feature request related to a problem? Please describe.
This request comes from a similar use case as https://github.com/fxamacker/cbor/issues/446. Essentially, Go struct objects are being serialized using
encoding/json
, transmitted to another program that does not have access to the definitions of the Go struct types, and deserialized (again usingencoding/json
) into an empty interface. I am working to support CBOR as a compatible alternative to the existing JSON encoding.Currently, there's an incompatibility when dealing with Go fields of type
[]byte
. The behavior ofencoding/json
is: marshaling[]byte
produces a JSON string containing the base64 encoding of the slice contents. Unmarshaling this back into a[]byte
does the reverse, transparently decoding the base64 string into the original bytes. Unmarshaling into an empty interface produces a Go string containing the base64 encoding.As expected, CBOR marshaling doesn't perform the base64 encoding or decoding, since CBOR provides distinct byte string and text string types. It also preserves that distinction when decoding into an empty interface value and produces a
[]byte
.https://go.dev/play/p/n8nnk-HnHGi
Describe the solution you'd like
RFC 8949 (in https://www.rfc-editor.org/rfc/rfc8949.html#section-3.4.5.2) specifies several tags for "expected later encoding" that an encoder may attach to byte strings to communicate how the string should be converted to JSON. I would like to be able to optionally configure the CBOR encoder to automatically apply tag 22 when it serializes a Go
[]byte
to a CBOR byte string, and optionally configure the decoder to honor the tag when decoding into an empty interface value.Support for encoding expected later encoding tags could be controlled by an EncOption that sets a single (or no) expected later encoding for any encoded
[]byte
. It might be interesting instead to infer tag 22 automatically when encoding struct fields of type[]byte
that havejson
field tags, but users would still reasonably expect an option to disable, so I don't think there a real upside there.A new DecOption would control the behavior of decoding expected later encoding tags into empty interface values and into Go strings.
Describe alternatives you've considered
I'd like to be able to implement this using TagSet, but I don't think it's possible with the current interface.
Additional context