fxamacker / cbor

CBOR codec (RFC 8949) with CBOR tags, Go struct tags (toarray, keyasint, omitempty), float64/32/16, big.Int, and fuzz tested billions of execs.
MIT License
730 stars 60 forks source link

feature: inspect first/outer "kind" without full decode #440

Open extemporalgenome opened 12 months ago

extemporalgenome commented 12 months ago

Is your feature request related to a problem? Please describe.

A nice property of the json.RawMessage design is that it's fairly trivial to safely inspect the broad kind of JSON data with:

// A properly decoded json.RawMessage always
// starts with a non-space token byte.
switch theRawMessage[0] {
case '{': // object
case '[': // array
case '"': // string
case 'n': // null
case 'f': // false
case 't': // true
default:  // number
}

This can also be done by inspecting leading bytes of a cbor.RawMessage, but there are many more leading bytes, and they're much less memorable (i.e. the application would need to implement a partial CBOR decoder to work around this package not providing kind detection as a cheap capability).

Decoding into any to just check the kind is often undesirable because:

  1. It's expensive, especially in terms of garbage.
  2. The contract is not stable, and hard to exhaustively account using type assertions, since DecOptions can yield uint64 vs int64 variations, many possible map and slice combinations, etc. Use of reflect provides more stability, but is unwieldy.

Describe the solution you'd like

Introduce a cbor.Kind type, with values like cbor.KindInt. It's unclear if distinctions between int vs uint vs big int, or the different size variants, should be represented, though bit field style constants (i.e. cbor.KindNumber = cbor.KindInt | cbor.KindFloat | ..., cbor.KindInt = cbor.KindInt8 | ...), or helper methods (func (Kind) IsNumber() bool) could solve for this.

A func DetectKind([]byte) (Kind, error) function could be used to obtain a Kind value. If there's a const KindInvalid Kind = 0 available, then such a function would not need to return an error.

A companion DetectTagKind function which returns a (uint64, Kind) (or similar), may also be useful.

Describe alternatives you've considered

It seems there is a branch or effort to expose a streaming tokenizer. If so, that could provide equivalent functionality, where the above case would merely involve a peek at the next token, potentially followed by a normal decode or token consumption.

fxamacker commented 11 months ago

@extemporalgenome Thanks for opening this issue! This sounds useful and makes sense. I'll look into it this month.