ScaleUnlimited / cascading.avro

Cascading Scheme for the Apache Avro data serialization format
Other
19 stars 25 forks source link

Unified AvroSchemes with depth parameter, made conversion logic pluggable, supported unions #36

Closed silasdavis closed 9 years ago

silasdavis commented 9 years ago

So I appreciate this is a rather large pull request. I've added various tests and I've tried to not break any existing APIs/contracts. I can look at splitting some of the changes out if it's really necessary, but some of them are linked and quite a few aspects of the project felt like they could use a cleanup.

Version bumps:

The old versions of Avro in particular were causing problems. For my usage I also required newer version of Cascading and Hadoop. It it is possibly and necessary to maintain different Hadoop/Cascading version I suggest we do so on separate branches.

Changes:

ccsevers commented 9 years ago

This looks really awesome, thank you!

Let me poke around a bit to make sure I understand the changes then we'll get it merged in.

silasdavis commented 9 years ago

Okay feel free to run any questions by me or discuss changes to the PR

silasdavis commented 9 years ago

I should also add for the record that I worked on these changes whilst working at SwiftKey, my employer, and SwiftKey is happy to license the contribution under the Apache 2.0 license.

kkrugler commented 9 years ago

Hey Chris - have you had a chance to review?

ccsevers commented 9 years ago

Not yet. I think also we should probably do a fairly major version bump for this since it changes quite a bit.

From: Ken Krugler [mailto:notifications@github.com] Sent: Monday, February 16, 2015 7:29 AM To: ScaleUnlimited/cascading.avro Cc: Severs, Chris Subject: Re: [cascading.avro] Unified AvroSchemes with depth parameter, made conversion logic pluggable, supported unions (#36)

Hey Chris - have you had a chance to review?

— Reply to this email directly or view it on GitHubhttps://github.com/ScaleUnlimited/cascading.avro/pull/36#issuecomment-74524917.

kkrugler commented 9 years ago

Agreed re version update - go to 2.6.0? And bump Cascading version dependency from 2.5.5 to 2.6.3? I'm thinking we should save the 3.0 bump for when it's a release that targets Cascading 3.x

silasdavis commented 9 years ago

I've been using this in our production-ish system, and I've made some improvements to how Avro SpecificData (records, enums, fixed, etc) are handled. The AvroSpecificRecordSerialization had some quirks (like the redundant WeakHashMap reference, unecessary flushes, etc). I've introduced a Serialization called AvroSpecificDataSerialization that is based on an abstract class from Hadoop that implements the base logic of Avro reading/writing and is used elsewhere. This class provides a broader set serializations of intermediate avro results and seems to have better performance.

I've added some tests for it, and in the process moved the generated avro classes and given some of them more standard names.

I'll leave these commits separate for now for review, but I can squash them before merging.

silasdavis commented 9 years ago

I've made another small but important update today that ensures AvroSpecificDataSerialization flushes on every write. The superclass from hadoop I was depending on only flushes on close, but Serializers are not meant to buffer (see our bug: https://issues.apache.org/jira/browse/HADOOP-11678).

I've pulled the serialization logic down to AvroSpecificDataSerialization. Also prefererred Cascading serializers for simple Java types such as long, by checking assignability to GenericContainer or Enum.

Do you have any idea when we might be able to merge this?

silasdavis commented 9 years ago

Cloasing this pull request to replace it with one against version-2.6 branch: https://github.com/ScaleUnlimited/cascading.avro/pull/37