flavray / avro-rs

Avro client library implementation in Rust
MIT License
169 stars 95 forks source link

Map JSON Objects to Record instead of Map #130

Open nschoellhorn opened 4 years ago

nschoellhorn commented 4 years ago

I noticed that the implementation for ToAvro on the serde_json Value type maps values of type Object to the Avro value Map instead of the complex type Record. Is there a specific reason for that? As far as I can tell, the Record type for Avro schemas is much more common in the definition than maps. Could we change the implementation to map Record instead? I'd be happy to contribute that if there's nothing that blocks us from doing so.

poros commented 4 years ago

It's been a few years since the original implementation, but I think that the rationale was that Object in JSON is very generic and not schematized, so we thought that Map (with the key being a string, like in JSON) was a closer fit than Record. I think that making it a Record might be pretty tricky, with fields that could be missing from one instance to the other.

Perhaps @flavray remembers this detail more clearly, but I am not quite sure.

nschoellhorn commented 4 years ago

For this problem, can we maybe make use of the Avro schemas? If we let the user provide one when going from JSON value to Avro value, we could get from the schema if it's defined as a record?

dcreager commented 4 years ago

I think you could, but I agree with @poros that it's not really the right thing to do. Using an Avro map means that you've actively decided that you need to support a possibly different set of keys in each instance. If you have a specific structure, you'd use a record instead.

It's the same on the serde_json side. Using serde_json::Value indicates that you want to support a possibly different set of keys in each instance. If you have a specific structure, you'd create a custom Rust struct for it. JSON doesn't have a map/record distinction like Avro does, so when serializing to JSON serde has to create a JSON object for both use cases. But since Avro does have that distinction, we can use the Avro type that lines up with the intended use of the Rust type.

nschoellhorn commented 4 years ago

Hmm, yeah, I understand what you are saying. What would be the option to go from user input (which is completely dynamic) to an Avro Value? Like, user specifies a schema they want to follow and then input a document of some kind and I want to convert it to Avro and send it out?

Currently, I am taking the user input as JSON, which is how I came up with this problem, but maybe there is a better solution to handle this with this library?

dcreager commented 4 years ago

Ahh yes, now I see what you're saying! For that, Avro defines a JSON encoding to go along with the binary encoding. And some of the other language bindings have a helper tool that reads in a bunch of JSON values and converts them to a binary Avro file according to some schema. The exact same JSON input can give you completely different Avro output files depending on which schema you choose.

I don't think that's implemented in this library just yet, but I think you'd do it by adding something similar to avro_rs::Reader that "reads" from a serde_json::Value instead of from a std::io::Reader. So it would be kind of like your original suggestion, but you'd be able to translate the JSON Value into either an Avro map or record, depending on your use case, by choosing what kind of Avro-y Rust type you read into.

nschoellhorn commented 4 years ago

Interesting approach and thoughts, thank you for your input. I'm gonna take a look into that after work :) maybe I can come up with something that's worth a PR. I'll keep this issue updated.