FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
310 stars 133 forks source link

[AVRO] Generation of Avro schemas with logical types. #133

Closed jcustenborder closed 2 years ago

jcustenborder commented 6 years ago

This is the first part. We're not ready to merge. I want to get feedback on the approach I have so far. I have schema generation working. An Avro Decimal must have the precision set. This means that I have to know this during schema generation. Because of this I decided to go with an annotation based approach. This seems to work pretty well so far. For example this is how.

{
  "type" : "record",
  "name" : "DecimalType",
  "namespace" : "com.fasterxml.jackson.dataformat.avro.schema.TestLogicalTypes$",
  "fields" : [ {
    "name" : "value",
    "type" : {
      "type" : "bytes",
      "logicalType" : "decimal",
      "precision" : 5,
      "scale" : 0
    }
  } ]
}

is generated by this java.

  static class RequiredDecimal {
    @JsonProperty(required = true)
    @AvroDecimal(precision = 3, scale = 3)
    public BigDecimal value;
  }

Using annotations gives us a weird benefit / problem with existing code. If you don't have an annotation a BigDecimal will be serialized as a string. This is great if you have existing code but a pain if you forget the attribute.

I went with a similar approach with all of the other logical types. In order to utilize them you need to use the corresponding annotation.

132

jcustenborder commented 6 years ago

Decimals are looking good. I have made some great progress here. I now have schema generation support for BYTES and FIXED based schemas via AvroDecimal.

{
  "type" : "record",
  "name" : "FixedDecimalType",
  "namespace" : "com.fasterxml.jackson.dataformat.avro.schema.TestLogicalTypes$",
  "fields" : [ {
    "name" : "value",
    "type" : {
      "type" : "fixed",
      "name" : "foo",
      "namespace" : "com.fasterxml.jackson.dataformat.avro.schema",
      "size" : 8,
      "logicalType" : "decimal",
      "precision" : 5,
      "scale" : 0
    }
  } ]
}
{
  "type" : "record",
  "name" : "BytesDecimalType",
  "namespace" : "com.fasterxml.jackson.dataformat.avro.schema.TestLogicalTypes$",
  "fields" : [ {
    "name" : "value",
    "type" : {
      "type" : "bytes",
      "logicalType" : "decimal",
      "precision" : 5,
      "scale" : 0
    }
  } ]
}

Right now I am going down the path of using the conversions that are built into Avro for this. @cowtowncoder The big thing I could use your assistance for is direction on the deserialization. Should I be looking into extending AvroParser with methods like decodeDecimal(), etc to handle these?

cowtowncoder commented 6 years ago

Excellent progress.

One update from my side: I am now leaning towards creating 2.10 branch, as sort of forward-looking minor version, with some of concepts from 3.0 (basically simple Builders for ObjectMapper, JsonFactory), to allow writing code that works with 2.10 and is easier to convert to 3.0. But also limit number of new features in 2.10 to as little as possible; to stabilize it as the only long-living 2.x maintenance branch.

But. I think this feature could make lots of sense for 2.10. That would make more sense than adding it in 2.9 patch.

I hope to release 2.9.6 sometime in May (next 2 weeks), and then create 2.10 branch.

jcustenborder commented 6 years ago

@cowtowncoder I have decimals finished and working properly. I added some testing so now I'm doing round trips as well as comparing the serialized output against the equivalent GenericRecord generated by Avro. I want to make sure that the output from Jackson is readable Avro.

I need a little help on the date types. Take timestamp-micros. It's the microsecond timestamp format for Avro. I'm not sure where to hook in on the Jackson side to get the actual Date object. It looks like Jackson is converting the field to long before it gets to me. How can I get the actual date value? How do I handle the java.time.* classes? Should this require the JavaTimeModule? Everything I have done so far is forcing a schema to be defined on the Java type.

jcustenborder commented 6 years ago

@cowtowncoder We are nearly there. I need your help with one more spot and I'm finished assuming there are no changes required for merging. I'm not sure how to override how java.util.Date is serialized. It seems like the serializer is selected before the AnnotationIntrospector has a change to pass of the serializer. To support timestamp-micros I need to convert this value to microseconds. I also want to grab the time part of the Date for time-micros and time-millis. I can't do this unless I am able to override their Serializers and Deserializers. Can you give me any pointers?

For completeness I have implemented all of the Avro Logical Types. I also created another module that will allow us to have java.time.* support for

jcustenborder commented 6 years ago

Nevermind on the AnnotationIntrospector. I made a mistake and inherited from the wrong class.

jcustenborder commented 6 years ago

@cowtowncoder I think I'm done. I have every logical type implemented with their java 7 and java 8 types. The build is failing for the java 8 portion. It's in a different module. Let me know if you want it moved someplace else. Please let me know what changes you need from here. Thanks for your help. It's much appreciated. I'm a huge fan of Jackson.

cowtowncoder commented 6 years ago

@jcustenborder Excellent. I hope to get 2.9 release done soon, and after that could consider merging this. Java 8 issue is bit tricky since we'll really only require Java 8 with Jackson 3.x. Although I could consider making Avro module Java 8 for 2.10, I don't know if it'd be easy to do it just for that, and keep other binary format modules Java 7.

I think what we could do is to merge Java 7 portions in 2.10 branch, but Java 8 ones only in master?

But I should probably have a look at code changes -- if changes should be safe for 2.9 (aside from Java 8 part), could just merge for 2.9, but announce changes for 2.10. That is, official addition of features for that.

cowtowncoder commented 6 years ago

Ah ok. I should have looked at code a bit earlier, as I think misunderstood approach.

One thing I am not quite sure about is the addition of many new annotations. I was kind of assuming that these would not (always?) be needed, since:

  1. Avro schema has logical type information, and that should be exposed via introspection functionality so that encoder/decoder is aware of mapping
  2. POJOs also have type information, for databinding, to do coercions.

I could perhaps see the need for some cases if information is not otherwise sufficient (like override to decide between BYTES/FIXED encoding). And obviously things like precision are problematic for (2) (schema has information, Java type not).

If new annotations are needed, on the other hand, a smaller set -- perhaps just one annotation? -- would seem better; with value being Java type (BigDecimal, date/time types), and possibly another property in case Avro shape (bytes, fixed, string etc, as Enum) can be one of multiple alternatives. This could be problematic if some specific attributes are needed (precision), but at the same time it seems that adding yet another new annotation type for each logical type could become problem wrt maintenance.

jcustenborder commented 6 years ago

@cowtowncoder Sorry about the delay. I was on vacation. What about moving to a single annotation with an enum for the avro type and logical type? Decimal is really the only weird one because it can be either a fixed or a bytes hence the implementation. My goal was to put something in place where the current code in the wild would still work as before but still support the new logical types. That's why I had so many annotations. I could reduce this all to a single annotation.

jcustenborder commented 6 years ago

@cowtowncoder I removed all of the attributes for a single attribute and deprecated AvroFixedSize. This will allow you to use a single AvroType attribute that can be used to create logical types and standard types such as a fixed. For example you can create a fixed with @AvroType(typeName = "FixedFieldBytes", fixedSize = 4, schemaType = Schema.Type.FIXED). A decimal with @AvroType(schemaType = Schema.Type.BYTES, logicalType = AvroType.LogicalType.DECIMAL, precision = 5). Let me know your thoughts about this one.

jcustenborder commented 6 years ago

@cowtowncoder Sorry to bug you. Just wanted to close the gap on this one. Is this closer to what you were looking for?

jcustenborder commented 6 years ago

@cowtowncoder Sorry to bug you again. Please let me know if you need more changes on this one or if you want me to break any more of it out. I'm also happy to assist on what's needed here for the 3.0 release.

jcustenborder commented 6 years ago

@cowtowncoder Sorry to bug you again. Do you need me to do anything else with this pull?

jcustenborder commented 6 years ago

@cowtowncoder Do you need me to do anything else with this pull?

lukejackson commented 6 years ago

big +1 from me on getting this feature reviewed and merged

lukejackson commented 6 years ago

@cowtowncoder is this PR held up because the branch it targets still needs to support Java 7? Should some of this PR target jackson-modules-java8, or is Java 8 support in core Jackson close enough that it can target this repo directly?

jcustenborder commented 6 years ago

My thoughts were I could back out the types that have time zone information. I was converting everything to UTC which is a data loss scenario. I could move to only support the date types that do not have offset information. This would be LocalDate, LocalTime, LocalDateTime, and Instant. For the other types I would create a logical type and gauge the avro communities support for them.

On Thu, Aug 16, 2018 at 7:17 AM Luke Jackson notifications@github.com wrote:

@lukejackson commented on this pull request.

In avro-java8/src/main/java/com/fasterxml/jackson/dataformat/avro/java8/AvroJavaTimeAnnotationIntrospector.java https://github.com/FasterXML/jackson-dataformats-binary/pull/133#discussion_r210572913 :

+ +import java.time.LocalDate; +import java.time.LocalDateTime; +import java.time.LocalTime; +import java.time.OffsetDateTime; +import java.time.ZonedDateTime; + +class AvroJavaTimeAnnotationIntrospector extends AvroAnnotationIntrospector {

  • static final AvroJavaTimeAnnotationIntrospector INSTANCE = new AvroJavaTimeAnnotationIntrospector();
  • @Override
  • public Object findSerializer(Annotated a) {
  • AvroType logicalType = _findAnnotation(a, AvroType.class);
  • if (null != logicalType) {
  • switch (logicalType.logicalType()) {
  • case TIMESTAMP_MILLISECOND:

I hadn't thought about adding new logical types to Avro.

I think the way forward depends on what you're looking to achieve. If it's a way to represent all the Java related time classes in Avro - i.e. the Java POJO is the canonical schema and the Avro representation is just derived from that - then I can understand you'd look for a way to encode them in Avro. There are some options:

  1. reusing another type which is close in meaning (e.g. timestamp-millis for LocalDateTime and have any consumers - which may not be this library - know that they should convert the timestamp into a ZonedDateTime with UTC time zone and convert to LocalDateTime)
  2. adding your own logical types. this would mean that any consumer - which again may not be this library - would also have to implement them.
  3. using the record type composed of types supported by Avro. this is what Jackson does when converting POJOs to JSON - they become JSON dictionaries. Avro records would be the equivalent here. e.g. LocalDateTime would be a record containing a date field and a time-millis/time-micros field.

If instead you are looking for a way to directly represent Avro records as Java classes - i.e. the Avro schema is the canonical schema and the Java POJOs are just derived from that - then you just need to support the Avro types and no more, and leave it to the user to define their schemas to model the data they want to encode (e.g. manually define a record representing local date and local time), or let them choose how to shoehorn the value into an existing Avro type (e.g. represent as a string, as a long, etc.)

For my use cases, the Avro schemas are the canonical schemas. As such I only use the Avro types. This is because I use Avro to communicate between applications written in multiple languages, such as Java, C++, Python, and I rely on Kafka tools that only support Avro types (e.g. Connect and its various sinks).

Where the Avro types don't support my use case I either use a different type (e.g. long for nano precision timestamps) or multiple fields (e.g. combination of date and time-millis for a local date time), and I'm happy to accept that these fields would appear as such in Java, C++, Python, my database tables, etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FasterXML/jackson-dataformats-binary/pull/133#discussion_r210572913, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMzgFWfU8v-2L1hg7mQR8Gd6ZbMGSjSks5uRWLYgaJpZM4T4g8K .

lukejackson commented 6 years ago

@jcustenborder it would be great if there was appetite within the Avro community to add new logical types.

timestamp-nanos would be particularly useful for the financial use cases my company faces. A duration type that is simply a number of nano seconds would also be useful (the Avro duration is a strange hybrid of months, days and milliseconds). Both are typically represented as a 64-bit long of seconds and a 32-bit int of nanoseconds - i.e. a fixed of size 12 as underlying Avro type.

Related to that is the following PR to add java.time logical types support for code generated Java specific records: https://github.com/apache/avro/pull/309

Looking forward to seeing java.time support in this and the Avro library.

cwuerz commented 5 years ago

Any news here? Avro 1.9.0 got released and logical types are supported in several mappers (e.g. for Scala: avro4s). I have looked into this PR and it looks quite promising. Is there a plan to eventually merge it?

cowtowncoder commented 5 years ago

I haven't had time to look into Avro module unfortunately, and I know there are a few PRs, issues that might be relatively easy to tackle. Not sure how it could be tackled.

One note though: any PRs for new handling would now need to go against 2.10 as 2.9 is so far down patches that goal is to minimize any regressions.

jcustenborder commented 5 years ago

I'll rebase to 2.10

jcustenborder commented 5 years ago

@cowtowncoder I rebased this to the 2.10 branch and tested.

cowtowncoder commented 5 years ago

Addressing earlier question on Java 8: Jackson 2.x core components need to run on JVM 6, compiled on JDK 7, as of Jackson 2.10. This baseline will not change, but it also need not change if changes are not in core components: changes here affect just this repo and binary dataformat modules. In fact the only question is (I think) as to baseline Avro module needs.

If absolutely necessary, I think requiring JDK 8 for Avro module might be acceptable to the community, esp. if Apache Avro library is moving. But it is something that would need to be decided.

I think that there might also be Jackson 2.11 after 2.10: for that, upgrade to baseline for Avro module would be even easier to do.

But the main question is whether there is strict need to go Java 8 or not with 2.x. Jackson 3.x will definitely require Java 8 anyway.

cowtowncoder commented 5 years ago

Hi again @jcustenborder! I'll try to help here, along with other things, and see if this can make it in 2.10 release (perhaps even 2.10.0.pr2). But I think what might help if others who are active with this module could perhaps have a look and give feedback. Aside from @lukejackson this could include @baharclerode and @marcospassos. I'll need to rely on others bit more -- although I may have best idea of intent on original design of the module, I need to balance my time across all modules, and I don't want to be blocker. Plus design evolves over time into new directions too, and that's where others can help discuss directions to take, trade-offs, challenges.

So if anyone has additional comments, suggestions, please add here, and I will try to get back to this soon too (I will add this on https://github.com/FasterXML/jackson-future-ideas/wiki/Jackson-Work-in-Progress to-evaluate section).

jcustenborder commented 5 years ago

Thanks @cowtowncoder! I appreciate your support and Jackson is an amazing project. Java 8 should not be required. It's only required for to handle java.time objects. If you are using java.util.Date it will work just fine.

@lukejackson, @baharclerode, @marcospassos, @cwuerz can you all provide feedback on this one?

cowtowncoder commented 5 years ago

EDIT: most of below is based on misunderstanding the goal here -- I was somehow assuming this was about detecting logical type from Avro Schemas, instead of doing the reverse for schema generation.


Ok. So I should have paid attention earlier here, as I am fundamentally confused -- why is this patch providing its own (de)serializers via AnnotationIntrospector? I assume it is necessary due to representation being something java8 module package would not easily be able to read from?

Problem for me is that this is pretty much against the design of jackson streaming API, databind.

So I'll have to think about this, and consider if there would be other ways to achieve design in which format layer, with schema information, would be able to represent content in a way that does work.

One immediate thought is that streaming api does have JsonToken.VALUE_EMBEDDED_OBJECT, exposing which does allow means to expose decoded values, and that standard deserializers can (and often do) make use of (and if not, is easy enough to extend). On writer-side there is just JsonGenerator.writeEmbeddedObject(), which could work for direct use but less so via databind, so that might be bigger challenge.

Anyway. I'll start thinking about this bit more, and hope to read it through with more thought. Avro module is already quite special case (but not as much of a mess as say XML format is :) ), but I want there to be some balancing act to try to avoid full-scale reimplementation of Java 8 date/time functionality.

jcustenborder commented 5 years ago

First off I really appreciate you take the time to look into this. This is a large patch so reviewing takes time. Something none of us tend to have.

Ok. So I should have paid attention earlier here, as I am fundamentally confused -- why is this patch providing its own (de)serializers via AnnotationIntrospector? I assume it is necessary due to representation being something java8 module package would not easily be able to read from?

I went down this path because some of the time formats support different resolutions. For example the logical type Timestamp has timestamp-millis and timestamp-micros to switch between millisecond and microsecond resolution. I added the module avro-java8 to allow java.time to be utilized under java8. I based this decision on jackson-modules-java8.

Problem for me is that this is pretty much against the design of jackson streaming API, databind.

To quote a buddy of mine, that's not optimal.

So I'll have to think about this, and consider if there would be other ways to achieve design in which format layer, with schema information, would be able to represent content in a way that does work.

One immediate thought is that streaming api does have JsonToken.VALUE_EMBEDDED_OBJECT, exposing which does allow means to expose decoded values, and that standard deserializers can (and often do) make use of (and if not, is easy enough to extend). On writer-side there is just JsonGenerator.writeEmbeddedObject(), which could work for direct use but less so via databind, so that might be bigger challenge.

I'm open to whatever suggestions you have, but would need some direction here.

Anyway. I'll start thinking about this bit more, and hope to read it through with more thought. Avro module is already quite special case (but not as much of a mess as say XML format is :) ), but I want there to be some balancing act to try to avoid full-scale reimplementation of Java 8 date/time functionality.

My only concern here is the logical types Time and Timestamp value millisecond and microsecond resolutions. I would need some pointers on how to handle this.

cowtowncoder commented 5 years ago

@jcustenborder Yes, fair point: it's not enough to suggest that approach seems less-than-optimal (if this was easy to do, it'd have been done etc etc), but to figure out best possible approach that gives value we want. And it is also true that Jackson is missing such intermediate layer that could help streaming layer give/take logical type indicators, so it would not be only up to databind.

In fact, CBOR backend has very similar need to allow type hints -- and come to think of that YAML, too!

So... I'll have to think about this some more. One additional thing worth reiterating is that adding one or two methods in JsonParser / JsonGenerator is fine, if that helps. So just as initial thought exercise... what if, in addition to (polymorphic) Type Id, and Object Id, that are exposed by streaming layer, there would be similar "logical type", to be exposed (as well as a way to check if backend provides such hints)?

But as to concreate steps: I would try to work on reader side, expose scalar (date/time) values as VALUE_EMBEDDED_OBJECT, and try to move things piece by piece.

cowtowncoder commented 5 years ago

Oh. And I think I also managed to miss the starting point here... so scratch most of my suggestions.

So just to make sure before trying to suggest anything else: goal here is to improve generation of Avro Schemas, from (possibly annotated) POJOs? If so, approach makes more sense, particularly as schema generation callbacks are based on serializers.

jcustenborder commented 5 years ago

The goal for this PR was to provide support for Avro Logical Types. Being able to map java types to the following logical types when annotated properly.

Java Type Avro Logical Type Notes
LocalDate Date Lossless conversion
LocalTime time-millis, time-micros Lossless conversion
LocalDateTime timestamp-millis, timestamp-millis Lossless conversion
Instant timestamp-millis, timestamp-millis Lossless conversion
BigDecimal decimal Lossless conversion
marcospassos commented 5 years ago

@jcustenborder There is a logical inconsistency around the LocalDateTime. One cannot convert LocalDateTime to instant without timezone information. Treating a local date-time as instant is a common mistake that can lead developers to bad practices. If the intention is to represent a date and time regarding a specific point in the timeline, then it would be correct to use a ZonedDateTime. So, I'd suggest mapping the logical-types accordingly.

jcustenborder commented 5 years ago

@marcospassos I'm not sure if I follow. LocalDateTime and Date are equivalents and are just storage of time since epoch which is UTC. The Avro logical type 'timestamp-millis' is milliseconds since epoch. There is no timezone information stored because Avro does not have a specification for storing timezone information.

marcospassos commented 5 years ago

LocalDateTime and Date are equivalents and are just storage of time since epoch which is UTC.

They're not.

This class does not store or represent a time-zone. Instead, it is a description of the date, as used for birthdays, combined with the local time as seen on a wall clock. It cannot represent an instant on the time-line without additional information such as an offset or time-zone.

jcustenborder commented 5 years ago

Technically you are correct. EPOCH is UTC and this pull follows what Avro itself does. It uses UTC for the time zone. For example

marcospassos commented 5 years ago

That's precisely why it's wrong. Encoding a LocalDateTime in UTC is not serialization, but conversion. We should not make assumptions like that.

jcustenborder commented 5 years ago

Can't say that I agree with you there. LocalDateTime does not have timezone information. Given the point of this PR is to add support for Logical Types, how would you store it? Forcing users to use ZonedDateTime doesn't necessarily fit either. Following the LogicalType specification there is no place to store the timezone offset. Jackson serializes LocalDateTime in JSON without storing the timezone.

On Mon, Sep 2, 2019 at 12:27 PM Marcos Passos notifications@github.com wrote:

That's precisely why it's wrong. Encoding a LocalDateTime in UTC is not serialization, but conversion. We should not make assumptions like that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FasterXML/jackson-dataformats-binary/pull/133?email_source=notifications&email_token=AABTHAH2N2QIWJ2JSWKGWWLQHVEHLA5CNFSM4E7CB4FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WJPZQ#issuecomment-527210470, or mute the thread https://github.com/notifications/unsubscribe-auth/AABTHAAFOOCNFIYRLLZ54P3QHVEHLANCNFSM4E7CB4FA .

marcospassos commented 5 years ago

I think you missed my point. LocalDateTime cannot be stored as a timestamp, so it should not be mapped like so.

jcustenborder commented 5 years ago

No I saw that. I just disagree.

On Mon, Sep 2, 2019 at 12:42 PM Marcos Passos notifications@github.com wrote:

I think you missed my point. LocalDateTime cannot be stored as a timestamp, so it should not be mapped like so.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cowtowncoder commented 5 years ago

@jcustenborder I would recommend you read more about Java 8 "local" date/time types: they are NOT timestamps -- "zoned" variant (ZonedDateTime) is (or, OffsetDateTime). Local here is bit odd naming convention as it could easily be thought to refer to date/time local to timezone but this is not true. Rather, it is "abstract" notion of specific date, or specific time (or both), but not bound to anything, not to timezone and not to UTC either. It must be bound to produce a timestamp. SO has this for example:

https://stackoverflow.com/questions/32437550/whats-the-difference-between-instant-and-localdatetime

But as to my earlier question: "The goal for this PR was to provide support for Avro Logical Types" is bit general as support can mean multiple things. What I was earlier assuming (incorrectly I think), was that it'd

  1. Use logical type information provided in schema to interpret encoded data to allow Jackson to read such Avro data into POJOs.

but I think this would not be doing that but rather

  1. Use additional annotation information to generate Avro Schema with logical type metadata.

That is fine, but leads to follow-up question of exactly where this should be supported. I am ok with format module adding such generation, although in some sense it might make sense to have separate "Avro schema generation" module, as there is one for JSON Schema module. But at the same time I am not sure this separation works cleanly.

jcustenborder commented 5 years ago

@cowtowncoder I think the disconnect is my comments are centered around binary compatibility with Avro data. My goal for this PR is to be able to write and write logical type data using Jackson that is compatible with Apache Avro.

We have had some of these discussions before. LocalDateTime is a timestamp without timezone information. The Avro library assumes that objects of LocalDateTime are UTC. The priority of this PR is to read and write data in a way that is compatible with Apache Avro.

marcospassos commented 5 years ago

LocalDateTime is a timestamp without timezone information.

There is no such thing like "timestamp without timezone information." A timestamp is a point in the timeline, always in UTC. What Avro does is conversion, rather than serialization - and it's wrong, IMHO.

jcustenborder commented 5 years ago

Conversion is fine given it's a known conversion. If we were talking about serialization in general I would agree with you. We are talking about conforming to a specific specification. Avro has a known specification for logical types. Both timestamp-millis and timestamp-micros state they are from UTC.

A timestamp-millis logical type annotates an Avro long, where the long stores the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC.

A timestamp-micros logical type annotates an Avro long, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC.

cowtowncoder commented 5 years ago

First of all, thank you for explanation of where mapping of LocalDateTime comes from. I do not know Apache project they added such a misguided conversion, but I can see why for compatibility reasons it would make sense to follow that direction. Still. To me that is just Wrong and it should only support Zoned- and/or Offset- variants, and for Local- representation should either be textual, or structured to use numeric components for year/month/day hour/minute/second.

Does Avro lib generate LocalDate / LocalTime / LocalDateTime in some form? And no Offset- variant at lest? If not latter, why? That just simply makes no sense: it is perfectly fine to create OffsetDateTime, default to UTC.

I'm bit at loss here as I can see the issue but not good solution. I would not want to be propagating what seem like misguided processing of date/time values.

cowtowncoder commented 5 years ago

On goal here: ok, yes. I can see that with serializer, deserializer, use of annotations it would implemented read/write support for Java 8 date/time types, using binary representations. And not just generate schema information. Approach used makes sense, too, as Avro does not have any such metadata in binary level so it needs to come somewhere

I can probably come around wrt Local date/time types as well, with some more time.

But with that, some follow-up questions:

  1. (open question, thinking out aloud) is there any way to split this into pieces that can be merged first? Like changes to readers that is not specific to new logical datatypes (but will be needed)
  2. Would it make sense to directly register (de)serializers for types where there is nothing to configure? For example, for BigDecimal, simply have Module register serializer, deserializer, to use, overriding what jackson-databind would otherwise use?
    • In fact, even if annotation is required, (de)serializer's createContextual() could check that information (with help of AnnotationIntrospector, perhaps, or directly) to create differently configure (de)serializer
    • Or is there so much necessary metadata for Schema generation that @AvroType is necessary?
  3. Do you think any of binary representations are universal enough that default Jackson deserializers could use those -- as an example, UUIDDeserializer does accept 16-byte binary data as input and UUIDSerializer writes such content if (and only if) JsonGenerator.canWriteBinaryNatively() returns true -- this is used by Smile and CBOR backends
cowtowncoder commented 5 years ago

Ohhhhh. Looks like Avro backend is NOT overriding JsonGenerator.canWriteBinaryNatively(). Will file an issue, fix. And see if I can write a test to show that handling works.... although this can have downside, too, if someone is counting on UUIDs (having to) be Strings.

jcustenborder commented 5 years ago

Thank you for reviewing this! Much appreciated. I'll go back over your comments and get some updates to address your feedback. I really appreciate your help.

cowtowncoder commented 5 years ago

@jcustenborder At this point, I think I have a good understanding of what would be needed here, at high level. But discussion might be easiest to have over email (and summarize here as needed). I sent you an email note to see if address I had still works.

But for others benefit, here's what I think:

  1. I understand the goals, and I actually am ok with date/time parts: it is very important to document/describe it right, and if so, I think use and mappings actually do make sense, looking purely from Java direction (I won't go deeper into it here but happy elaborate if anyone interested)
  2. There are time constraints in getting 2.10 out by end of September, which requires decision soon on what to include
  3. I have some plans to tackle some aspects (but not all) of what is worked here, in bit more general way -- some issues are similar with CBOR and Protobuf format modules for example
  4. Since general changes (that I haven't done or fully flushed out yet) and specific proposed changes here likely overlap, some coordination would be good so that short-term specific ones do not block general changes (or vice versa)

So.... with that, I was thinking that it would be great to essentially.

  1. Not yet fully integrate this functionality in 2.10 (due to time constraints, and risks in rushing changes -- this is my fault, mostly)
  2. Provide enough of base changes to allow external implementation to be published, and be used with 2.10, to help guide 2.11 design
  3. Likely integrate rest of functionality (however divided across general functionality / Avro-specific handlers) in 2.11 (and 3.0).

The trick, really, is just to figure out sort of minimal set of changes needed in Avro module 2.10. I can also share some ideas I have for logical types, but I hope to first agree on approach here.

Thank you everyone for working on getting this done: it is pretty ambitious but important undertaking, and once working, should open up new possibilities, beyond what I had thought practical.

cowtowncoder commented 5 years ago

@jcustenborder Ok. After one more re-read, things much more clear. Changes to existing Avro backend make sense, the only possible caveat being UUID handlers, but even those would be fine I think (could improve later). Changes to add new module are something that I think should start as separate module, first, and then we could figure out how to package -- probably add as sub-module here in 2.11 or so.

Would it be possible to split this into two parts? I think that Avro-package changes are only related to internal API, not externally usable (... or if someone uses it, I think that'd be unsupported), so I would be ok with this going in one of 2.10.x patches if it does not make it in 2.10.0 (we only have 1 more day until that). But I'd like to get that part in soon, if possible. And I'll create 2.11 branch relatively quickly after 2.10.0 release and we can iterate over some of the other changes.

cowtowncoder commented 2 years ago

I think this is probably obsoleted by later work, closing.