Open jagill opened 1 month ago
I think this might actually be a feature request to perform string -> bool coercion which we don't currently? I don't think this is specific to MapArray, I'm also not entirely sure what the semantics should be.
In particular is "True"
also valid or is it a parse error? What about "1"
? The semantics are not obvious to me
First, I'm happy to classify this however is useful. However, when I read your response, I wonder if I didn't communicate my issue clearly.
I don't think this is specific to MapArray
This is actually specific to MapArray.
I am not talking about general string -> bool conversion. The issue is that in JSON, the only syntactically valid way to represent a Map<bool, T> is similar to {"true": T}
or {"false": T}
. The JSON-esque {true: T}
or {false: T}
is not syntactically valid; all JSON objects must have string keys. The only other place that the key type comes into play is StructArray, which requires its keys to be strings and thus doesn't suffer from this.
To be clear, I am not asking that a Map<string, bool> accept {"foo": "true"}
. In the value position, you can have arbitrary JSON values, so {"foo": true}
is valid and should be the only way to represent a Map<string, bool>.
If you check out my gist, you'll see that arrow_json already does the key-string conversion for ints. To be specific, the only valid way to represent a Map<i32, i32> is like {"1": 2}
({1: 2}
is not syntactically valid). arrow_json does the string parsing for the key position for (non-bool) primitives. My issue is that this is not currently done for bool keys. This means that it is impossible to represent a Map<bool, T> in a way that is both valid JSON and is parsable by arrow_json.
I can certainly see that you might phrase this as "arrow_json does not currently support Maps with bool keys; adding that support is a feature request." In that case, it would be helpful to specify which types are not supported by arrow_json in the documentation, so people know whether it's a bug or a feature request :).
In particular is "True" also valid or is it a parse error? What about "1" ? The semantics are not obvious to me
IMO, the semantics should be that when parsing Map(K, V), that the JSON string in the key position is parsed with <K as FromStr>::from_str
, although I don't have a strong opinion about which parser to use. This is already done for int/float keys.
Does that clarify things?
I understand what you're saying, however, arrow-json does not distinguish between the case of decoding a boolean from a key and decoding a boolean from a value, the decoders simply operate on the fields - https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/map_array.rs#L131. I also personally think it would be rather odd for the decoders to treat these two cases differently.
I'm not really sure what the best solution here is, we certainly could support coercing strings to booleans, perhaps gated by a feature flag, but I'm not really sold on this. Perhaps you could expand upon the use-case for this? I can't help feeling a boolean keyed MapArray is a really inefficient way to represent what could simply be a StructArray with two nullable fields of true
and false
respectively.
As an aside you raise the point of integers parsing from strings, this is largely because many systems encode integers and floats as strings to get around precision ambiguities in JSON parsers - with some treating all numbers as double-precision floating point.
Thanks for the clarification, and apologies for a long-winded explanation of something you already understood.
As prior art, serde_json (the standard json deserializer) allows "true" in the key position but not the value position: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dea421d1f8c2075ee9ed791155a516db serde_json has a whole internal lexer/parser which may be more than arrow_json wants. As an aside, serde has an internal IR that we could imagine using for arrow_json; that is, serde_json makes the IR then arrow_json builds arrays from that IR. I suspect that refactor is way out of scope for this, so I'm not suggesting that, per se, but it'd be an interesting direction to consider if it reduced code/maintenance.
As for why I care about Map<bool, T>, it's because it's a valid SQL type and I maintain a Rust client for Presto. Presto returns its results in JSON (with a schema) (more details at https://prestodb.io/docs/current/develop/client-protocol.html). When users want the results in Arrow RecordBatches, I use arrow_json to read the JSON into Arrow. Thus I don't control whether the data contains Map<bool, T> (the users do), nor do I control the return format (syntactically valid JSON). Currently Map<bool, T> is supported for non-arrow, but not for arrow, due to the limitation in this issue. I was hoping to bring feature parity to the arrow return format.
serde_json makes the IR then arrow_json builds arrays from that IR.
We moved away from serde_json both for performance and to be able to better handle arbitrary precision values. We used to do something similar to this.
I think given that Map<bool>
whilst valid is incredibly niche, serde_json only added support at your request last year, and using a StructArray with two fields will be orders of magnitude faster and more space efficient, I'd be inclined to hold off on this until we have a concrete user use-case to justify this
Thanks for the quick response!
Out of curiosity, do you have the benchmarks for the serde_json vs arrow_json performance? It would be interesting to me for my own client.
https://github.com/apache/arrow-rs/pull/3479 and #5318.
Describe the bug JSON objects must have string keys. So when parsing to maps with non-string key types, the string must be coerced into the type. For example, using a schema with a Map(Int32, Int32) type parses rows like
{"map_i32_i32": {"1": 2}}
into a RecordBatch with a single column with a Map(Int32, Int32) type.However, this does not happen for bool key maps. If you try to use a schema with a Map(Boolean, Int32) type, parsing
{"map_bool_i32": {"true": 1}}
raises an error:To Reproduce Gist https://gist.github.com/jagill/1dd71ff6413002a042089feca89c6eab has a minimal reproduction.
Expected behavior
map_bool_i32
should parse to a Map(Boolean, Int32) array.Additional context Similar bug report for serde_json, with a link to the fix PR: https://github.com/serde-rs/json/issues/1054