Unified AvroSchemes with depth parameter, made conversion logic pluggable, supported unions

silasdavis commented 9 years ago

So I appreciate this is a rather large pull request. I've added various tests and I've tried to not break any existing APIs/contracts. I can look at splitting some of the changes out if it's really necessary, but some of them are linked and quite a few aspects of the project felt like they could use a cleanup.

Version bumps:

Hadoop 1.0.3 -> 2.4.0
Cascading 2.1.6 -> 2.6.1
Avro 1.7.4 -> 1.7.7

The old versions of Avro in particular were causing problems. For my usage I also required newer version of Cascading and Hadoop. It it is possibly and necessary to maintain different Hadoop/Cascading version I suggest we do so on separate branches.

Changes:

Unified the notions of AvroScheme and PackedAvroScheme with a depth paramter (how far should we unpack records into TupleEntries). This is mostly to support ShallowAvroScheme which is a TupleEntry representing an outer Avro record, but leaves any first children Avro record fields as Avro records. Adding AvroSpecificRecordSerialization to the JobConf io.serializations means we get intermediate serializability when doing this.
Made the conversion logic between Avro and 'Cascading' types pluggable. There are various ways we may want to represent logical Avro records in Cascading, I've split and tidied up this logic so that you can provide your own classes implementing the interface AvroConverter<TRecordFrom,TRecordTo>, or you subclass one of the conversion classes (themselves sharing common traversal logic, pivoting on a Schema, in AvroConverterBase).
Supported fairly arbitary union types. It is conceivable that with a weird-enough union type (such as one containing a string array and a string map) that the conversion won't give you what you want, but I think it most cases it should be adequate and certainly improves on what was there.
Increased the flexibility of Avro conversion logic with respect to byte[], List, TupleEntry, Map etc by capturing the many-to-many nature of the type mappings in TypeMappings
Standardised by representing Avro maps and arrays as Tuples, and records as TupleEntries in Cascading land. Provided helpers to convert to maps. This means intermediate serialisations (such as TempHfs and others) don't break (requiring extra stages in pipeline) when dealing with GroupBy and other aggregations (Cascading doesn't know how to serialise lone java Maps and Lists - but you can embed them in Avro records if you like).
Made CascadingToAvro serialize into SpecificRecords when they are available via SpecificData.get().getClass(schema). Just downcast the the converted IndexedRecord.
Used functionality in newer Avro to remove reflection from AvroSpecificRecordSerialization
Updated some test logic
Added some tests

ccsevers commented 9 years ago

This looks really awesome, thank you!

Let me poke around a bit to make sure I understand the changes then we'll get it merged in.

silasdavis commented 9 years ago

Okay feel free to run any questions by me or discuss changes to the PR

silasdavis commented 9 years ago

I should also add for the record that I worked on these changes whilst working at SwiftKey, my employer, and SwiftKey is happy to license the contribution under the Apache 2.0 license.

kkrugler commented 9 years ago

Hey Chris - have you had a chance to review?

ccsevers commented 9 years ago

Not yet. I think also we should probably do a fairly major version bump for this since it changes quite a bit.

From: Ken Krugler [mailto:notifications@github.com] Sent: Monday, February 16, 2015 7:29 AM To: ScaleUnlimited/cascading.avro Cc: Severs, Chris Subject: Re: [cascading.avro] Unified AvroSchemes with depth parameter, made conversion logic pluggable, supported unions (#36)

Hey Chris - have you had a chance to review?

— Reply to this email directly or view it on GitHubhttps://github.com/ScaleUnlimited/cascading.avro/pull/36#issuecomment-74524917.

kkrugler commented 9 years ago

Agreed re version update - go to 2.6.0? And bump Cascading version dependency from 2.5.5 to 2.6.3? I'm thinking we should save the 3.0 bump for when it's a release that targets Cascading 3.x

silasdavis commented 9 years ago

I've been using this in our production-ish system, and I've made some improvements to how Avro SpecificData (records, enums, fixed, etc) are handled. The AvroSpecificRecordSerialization had some quirks (like the redundant WeakHashMap reference, unecessary flushes, etc). I've introduced a Serialization called AvroSpecificDataSerialization that is based on an abstract class from Hadoop that implements the base logic of Avro reading/writing and is used elsewhere. This class provides a broader set serializations of intermediate avro results and seems to have better performance.

I've added some tests for it, and in the process moved the generated avro classes and given some of them more standard names.

I'll leave these commits separate for now for review, but I can squash them before merging.

silasdavis commented 9 years ago

I've made another small but important update today that ensures AvroSpecificDataSerialization flushes on every write. The superclass from hadoop I was depending on only flushes on close, but Serializers are not meant to buffer (see our bug: https://issues.apache.org/jira/browse/HADOOP-11678).

I've pulled the serialization logic down to AvroSpecificDataSerialization. Also prefererred Cascading serializers for simple Java types such as long, by checking assignability to GenericContainer or Enum.

Do you have any idea when we might be able to merge this?

silasdavis commented 9 years ago

Cloasing this pull request to replace it with one against version-2.6 branch: https://github.com/ScaleUnlimited/cascading.avro/pull/37

ScaleUnlimited / cascading.avro

Unified AvroSchemes with depth parameter, made conversion logic pluggable, supported unions #36