Closed silasdavis closed 9 years ago
This looks really awesome, thank you!
Let me poke around a bit to make sure I understand the changes then we'll get it merged in.
Okay feel free to run any questions by me or discuss changes to the PR
I should also add for the record that I worked on these changes whilst working at SwiftKey, my employer, and SwiftKey is happy to license the contribution under the Apache 2.0 license.
Hey Chris - have you had a chance to review?
Not yet. I think also we should probably do a fairly major version bump for this since it changes quite a bit.
From: Ken Krugler [mailto:notifications@github.com] Sent: Monday, February 16, 2015 7:29 AM To: ScaleUnlimited/cascading.avro Cc: Severs, Chris Subject: Re: [cascading.avro] Unified AvroSchemes with depth parameter, made conversion logic pluggable, supported unions (#36)
Hey Chris - have you had a chance to review?
— Reply to this email directly or view it on GitHubhttps://github.com/ScaleUnlimited/cascading.avro/pull/36#issuecomment-74524917.
Agreed re version update - go to 2.6.0? And bump Cascading version dependency from 2.5.5 to 2.6.3? I'm thinking we should save the 3.0 bump for when it's a release that targets Cascading 3.x
I've been using this in our production-ish system, and I've made some improvements to how Avro SpecificData (records, enums, fixed, etc) are handled. The AvroSpecificRecordSerialization had some quirks (like the redundant WeakHashMap reference, unecessary flushes, etc). I've introduced a Serialization called AvroSpecificDataSerialization that is based on an abstract class from Hadoop that implements the base logic of Avro reading/writing and is used elsewhere. This class provides a broader set serializations of intermediate avro results and seems to have better performance.
I've added some tests for it, and in the process moved the generated avro classes and given some of them more standard names.
I'll leave these commits separate for now for review, but I can squash them before merging.
I've made another small but important update today that ensures AvroSpecificDataSerialization flushes on every write. The superclass from hadoop I was depending on only flushes on close, but Serializers are not meant to buffer (see our bug: https://issues.apache.org/jira/browse/HADOOP-11678).
I've pulled the serialization logic down to AvroSpecificDataSerialization. Also prefererred Cascading serializers for simple Java types such as long, by checking assignability to GenericContainer or Enum.
Do you have any idea when we might be able to merge this?
Cloasing this pull request to replace it with one against version-2.6 branch: https://github.com/ScaleUnlimited/cascading.avro/pull/37
So I appreciate this is a rather large pull request. I've added various tests and I've tried to not break any existing APIs/contracts. I can look at splitting some of the changes out if it's really necessary, but some of them are linked and quite a few aspects of the project felt like they could use a cleanup.
Version bumps:
The old versions of Avro in particular were causing problems. For my usage I also required newer version of Cascading and Hadoop. It it is possibly and necessary to maintain different Hadoop/Cascading version I suggest we do so on separate branches.
Changes:
io.serializations
means we get intermediate serializability when doing this.byte[]
,List
,TupleEntry
,Map
etc by capturing the many-to-many nature of the type mappings inTypeMappings
SpecificData.get().getClass(schema)
. Just downcast the the convertedIndexedRecord
.