Avro with interface fields?

FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile

Apache License 2.0

305 stars 129 forks source link

Avro with interface fields? #101

Closed tbanyai closed 6 years ago

tbanyai commented 6 years ago

Hello!

First of all I am aware of the documentation saying Avro does not work well with polymorphic type handling, but I decided to try my luck. I use Redis for storing serialized data. For that Jackson/JSON is being used but memory became an issue and because of using (limited) amount of Json annotations (JsonIgnore, JsonView and JsonTypeInfo), Avro seems to be the perfect choice.

Few classes are to be serialized/deserialized which are almost POJOs, except some fields are interfaces whose are annotated with JsonTypeInfo, similar to this:

@JsonTypeInfo( 
    use = JsonTypeInfo.Id.NAME,
    include = JsonTypeInfo.As.PROPERTY,
    property = "clazz")
interface MySubClassInterface{
    ...
}

public class SubClass implements MySubClassInterface{
    ...
}

public class MainClass{
    ...
    @JsonView(MainClass.Detailed.class)
    MySubClassInterface p1=new SubClass();
    ...
}

This works well with Json but Avro fails with: JsonFormatException: Issue converting to avro (MainClass@4ec6a292) : No field named 'clazz' (through reference chain: MainClass["p1"])

Is there a way to make this work somehow?

cowtowncoder commented 6 years ago

I am not 100% sure. I think may be possible to make it work, but so that type identifier field is defined in Avro schema. So you would probably need to construct schema yourself, instead of generating it. However... I wonder if it might be possible to enhance schema generation to actually expose type discriminator. I don't know if that is possible currently (because introspection code effectively hides this information. I think what I can do is to maybe modify title here, and indicate that issue is about trying to make use case you point to work better.

Having said, I have alternative idea for your actual need here; will add as separate comment.

cowtowncoder commented 6 years ago

So: for more efficient representation for cached objects. I would suggest trying out "serialize POJOs as Arrays", explained f.ex here:

https://medium.com/@cowtowncoder/serializing-java-pojo-as-json-array-with-jackson-2-more-compact-output-510a85c019d4

AND using Smile format (or CBOR). Combining the two should produce serialization that is quite close to Avro lengths (although not necessarily exactly as small; this depends on details), but allowing wider range of functionality to be used.

The only concern I have is as to how well polymorphic handling works with as-array output (ironically enough as I am offering it to be used...). But one thing worth noting is that shape can be defined separately for different types and fields, so you can often use compact notation just for parts of data model, where savings are most important.

Also possible: use of protobuf module. It requires schema, but is bit more flexible. Polymorphic handling still requires exposing type-id field explicitly however.

tbanyai commented 6 years ago

Thank you for your suggestions. Protobuf unfortunately gives me an error of Map/String/... can not be root object (always the type of the foremost field in the POJO), with version 2.8.8. I'll further investigate if it's my mistake or not.

The CBOR+serialization as-array sounds very interesting and feasible for my case, however the problem there is that it doesn't go inside class fields unless annotated. But then the XML for dumping results to file and JSON for servlet responses break. Is there a configuration option that globally turns on serializing as-arrays, but only for that ObjectMapper?

In fact, I am wondering if it would be possible something along the line to override Avro or Protobuf serializer's default behaviour such when it encounters a field which is an interface (or having JsonType annotation,...) the serializer looks up/generates the schema of the actual class of that field and recursively calls the serializer? (At that point encoding class type would be easy I guess, and of course with a similar mechanism on the deserializer side.) The data I need to maintain are isolated sets of 1 main/root class (which the serializer is called on) having a structure of ~10-20 nested classes few levels deep with ~2-5 interfaces here and there, so having a mapper per set alongside with a map storing the generated schemas is no a performance killer.

cowtowncoder commented 6 years ago

Serialize-as-array can be configured on per-type bases, using 2.8 feature "configOverrides()". Alternatively you could use mix-in annotations too. Or, if you can have a shared base class / interface that value types extend, could specify @JsonFormat overrides via mix-in annotations (configOverrides() only work for exact type, unfortunately, and not via supertypes).

In either case, separate mapper is needed; but since output format is different, this is required anyway. I think prototyping this might be simplest way to go, and most likely to yield something that improves things a lot.

As to Avro/Protobuf: neither changes any of the behavior of databinding: so serializers/deserializers are not aware of differences in underlying formats at all. Only parser/generator does this (XML is the only exception where there are minor diffs at higher level, for SerializerProvider; but even there serializers/deserializers have no changes).

It is possible that handling of polymorphic types will (need to) be improved for Avro/Protobuf in future, and 2.9 actually contains new hooks that allow this (JsonGenerator now has writeTypePrefix() / writeTypeSuffix()), giving more control over to streaming API. So this may allow way forward. However... this would all be post-2.9, and would take time.

tbanyai commented 6 years ago

Ok, finally I am ending up with CBOR+ as-array serialization using mixin annotations. This gives me about 50% reduction in memory.

However it would be really nice to have a global switch in the mapper to set as-array serialization for all classes. Do you think is possible?

cowtowncoder commented 6 years ago

@tbanyai Glad to hear that approach works! (another alternative is jackson-dataformat-smile, which can be slightly smaller yet, easy to switch back and forth). There is no way to force default base JsonFormat.Value (unlike with default setter info or inclusion), but you can achieve this by sub-classing JacksonAnnotationIntrospector and overriding method findFormat(), to return value with shape of Shape.OBJECT (some care to be taken not to override explicit annotation etc).

Or, possibly better if possible would be to use either a common base interface with annotation, or, if you have certain sets of annotations that all value types need, to create "annotation bundle". This simply means creating a Java Annotation type with annotations you want plus @JacksonAnnotationsInside; then using that single annotation: any annotations it contains are then expanded. While you still need to add that one annotation (... possibly in shared base-class or -interface), it's more compact than adding multiple ones or even full @JsonFormat.

tbanyai commented 6 years ago

Overriding the JacksonAnnotationIntrospector's findFormat is exactly what I was looking for (because jsonformat is not used in the code). I need to do some polishing on forwarding format annotations when declared, returning nulls on Double, Integer,... etc, but it is already working and produces smaller than half serials (compared to Json). And it requires minimal maintenance.

With mixin annotations I was doing some reflection when building each mapper (associated with the given root class it needs to handle). Of course that approach was missing some potential because it can't look behind the interfaces.

Thank you very much for the help, I'm closing this ticket.

P.S.: realizing that I only needed a 10 liner class with a single override and an extra line configuring the mapper, IMHO I believe many others would also find it useful if there would be an out-of-the box solution (something like "CompactCBOR" and of course accepting the cost of losing the self-describing nature) for producing as small as possible binary serials.

cowtowncoder commented 6 years ago

@tbanyai Thank you for sharing your progress. I am happy that things worked out well -- this is exactly kind of usage I envisaged when originally added the write-as-array feature.

Interesting idea about providing something like this in some kind of pre-packaged form. I will have to think more about this: it could also be a module or something, esp. since it could be usable with couple other formats too. I know it would be kind of nice to included it in jackson-databind for convenience ("not yet another small module"), but I am bit concerned about consistency and orthogonality of API: how to keep things as lean as possible (considering functionality offered), constrain domain/usage-specific aspects.

One other thing here could be writing an article/blog on usage: I don't know if you do that regularly but if you do (or are open to writing something) I think there would be many Java developers who would love to read it. I could also write something, but you have actual real usage experience here and can probably add all necessary details and somewhat more objective view of things -- as package author I am bound to be more subjective for better or worse.