Changes to allow the use of cascading.avro in Cascalog

kkrugler commented 11 years ago

I made a number of changes (most notably was the overloaded constructor that added support for providing avro field names which may differ from the tuple field names). Cascalog uses prefixes in the tuples like ? and ! which are not allowed as avro fields. For example, someone can name the tuple "?name" and the avro field "name".

The README has details about other changes.

In short, this pull request provides:

new constructor to support tuple/avro field name mapping
updated avro to 1.6.3 (and made some adjustments as needed)
added change that ensured avro.codec is written out in the file
added automatic conversion of date
changes to POM to indicate new version/fork

I really just tried to stick with the coding style as much as possible, but feel this whole thing can be cleaned up a bit.

Pull as you please.

kkrugler commented 11 years ago

Thanks - I'll take a look and see if all/some of the mods you have made can be merged into the project.

kkrugler commented 11 years ago

Great. Let me know if you have any questions.

On Wed, May 9, 2012 at 9:52 AM, vmagotra < reply@reply.github.com

wrote:

Thanks - I'll take a look and see if all/some of the mods you have made can be merged into the project.

Reply to this email directly or view it on GitHub: https://github.com/bixolabs/cascading.avro/pull/7#issuecomment-5600507

kkrugler commented 11 years ago

Hi,

Updated to Avro 1.6.3. This had some interesting effects though. The newer Avro no longer supports nested Enums, so I needed to split out the test. I also ran into trouble with nullSchema and Map and Enum fields. So these are no longer "optional" if used.
<< Just to clarify on the above : Do you mean that if you define a field in Avro that's a Map or an Enum, then it has to exist in the data being written (can't rely on nullSchema to fill in a null value for you) ?

I think the other changes are good to be rolled in...

kkrugler commented 11 years ago

Hi Mike - we just created a 2.0 branch, and merged most the cascading-avro code (Sven's fork/modifications) with some of the cascading.avro code.

Once that gets merged into trunk, then I'll need to find time to look through your changes and figure out which ones to cherry-pick. E.g. I know cascading-avro had some support for Cascalog field renaming, but I haven't looked at how they implemented that.

-- Ken

kkrugler commented 11 years ago

Hi Mike,

If you're still interested in this can you have a look at 2.0-develop and see if it will do what you need it to? If not, can you make a pull request on that branch?

kkrugler commented 11 years ago

I'm definitely still interested in this and have been reliably using my fork for a little while now. I'm actually in the process of doing a wider upgrade in my own projects, and rolling up to cascading 2.0 in the process. This lead me back to this thread as i wanted to see where you guys were and if you had made any progress upgrading to 2.x. i now see the 2.0 develop branch and will check it out. i'll take a look and get back to you shortly.

kkrugler commented 11 years ago

I'm going to need to migrate some of my changes onto this branch as there are things that i will need and don't think are supported in this (correct me if i'm wrong). I will need:

support for specifying the avro output codec (null, deflate, etc.). in my patch i was passing this in as part of the job conf.
support for renaming fields (e.g. Cascading Tuple Fiel Names mapped to Avro Field names). Cascalog uses field names like '?afield' and that is not a valid avro field name. So i needed a way to map ?a_field to an_avro_field. I had an overloaded constructor in my patch that handled this. I haven't seen how to do this yet on your branch (but could be missing it).
support for Date class types (auto convert to Long for avro file and back to date on Java side). this was the least important of my changes, but i found it convenient as i was working with a mixture of db taps and avro taps.

those three changes are important for my use case. i'd be happy to provide a patch.

kkrugler commented 11 years ago

Hi Mike,

All those changes sound great. I can do the output codec if you don't want to worry about it but the others are probably best submitted as a patch. I think 2.0-develop will become master any day now and my guess is we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability to get the unpacked Avro record (similar to how SequenceFile support works) and also pass a packed Avro record to write out. I'm adding this to make it easier to use with the Scalding typed API but it might be useful for Cascalog too.

Regards, Chris

kkrugler commented 11 years ago

Thanks Chris. I'll take a look at the change you mentioned.

Another thing I remembered adding/needing is the ability to set fields as nullable. I hacked it so most/all my fields were nullable, and can hack it for Cascalog based on naming conventions. But it probably be best to come up with a more direct approach to specifying nullable-fields in the Avro output/schema.

kkrugler commented 11 years ago

Hi Mike,

On Oct 26, 2012, at 9:22pm, Mike Stanley wrote:

Thanks Chris. I'll take a look at the change you mentioned.

Another thing I remembered adding/needing is the ability to set fields as nullable. I hacked it so most/all my fields were nullable, and can hack it for Cascalog based on naming conventions. But it probably be best to come up with a more direct approach to specifying nullable-fields in the Avro output/schema.

cascading.avro supports unions, where the "other" field value is nullable.

Are you suggesting an option to automagically add that to all fields?

And would this be for reading, writing, or both?

Thanks,

-- Ken

Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

kkrugler commented 11 years ago

Hi all,

I'm going to look at merging 2.0-dev into master this weekend, with whatever is in that branch.

Then I'll do a 2.1.0 release to Conjars.

After that we can add in Mike's changes (hopefully as a pull request). I guess those would be a 2.2 release, since it's new functionality vs. just bug fixes.

Makes sense?

Thanks,

-- Ken

On Oct 26, 2012, at 9:17pm, Chris Severs wrote:

Hi Mike,

All those changes sound great. I can do the output codec if you don't want to worry about it but the others are probably best submitted as a patch. I think 2.0-develop will become master any day now and my guess is we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability to get the unpacked Avro record (similar to how SequenceFile support works) and also pass a packed Avro record to write out. I'm adding this to make it easier to use with the Scalding typed API but it might be useful for Cascalog too.

Regards, Chris

— Reply to this email directly or view it on GitHub.

Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

kkrugler commented 11 years ago

Sounds good to me.

Chris

kkrugler commented 11 years ago

On my fork, I simply made all fields nullable, but wouldn't recommend that for a general solution. It's writing side where it matters I think.

In cascalog nullable fields are named !field instead of ?field. It be nice to infer from that. I need to look at the changes in the dev branch more closely before making any recommendation with regards to this enhancement. It may be easier to do with the new implementation or it my be more appropriate in some sort of cascalog lite weight wrapper.

I will let you know.

... Mike Please excuse typos (fat thumbing an iPad)

On Oct 27, 2012, at 2:49 PM, Ken Krugler notifications@github.com wrote:

Hi Mike,

On Oct 26, 2012, at 9:22pm, Mike Stanley wrote:

Thanks Chris. I'll take a look at the change you mentioned.

Another thing I remembered adding/needing is the ability to set fields as nullable. I hacked it so most/all my fields were nullable, and can hack it for Cascalog based on naming conventions. But it probably be best to come up with a more direct approach to specifying nullable-fields in the Avro output/schema.

cascading.avro supports unions, where the "other" field value is nullable.

Are you suggesting an option to automagically add that to all fields?

And would this be for reading, writing, or both?

Thanks,

-- Ken

Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

— Reply to this email directly or view it on GitHubhttps://github.com/bixolabs/cascading.avro/pull/7#issuecomment-9838466.

kkrugler commented 11 years ago

Sounds good to me too. I will come back around with patches, once I have a chance to take the 2.1.0 release for a spin.

... Mike Please excuse typos (fat thumbing an iPad)

On Oct 27, 2012, at 2:52 PM, Ken Krugler notifications@github.com wrote:

Hi all,

I'm going to look at merging 2.0-dev into master this weekend, with whatever is in that branch.

Then I'll do a 2.1.0 release to Conjars.

After that we can add in Mike's changes (hopefully as a pull request). I guess those would be a 2.2 release, since it's new functionality vs. just bug fixes.

Makes sense?

Thanks,

-- Ken

On Oct 26, 2012, at 9:17pm, Chris Severs wrote:

Hi Mike,

All those changes sound great. I can do the output codec if you don't want to worry about it but the others are probably best submitted as a patch. I think 2.0-develop will become master any day now and my guess is we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability to get the unpacked Avro record (similar to how SequenceFile support works) and also pass a packed Avro record to write out. I'm adding this to make it easier to use with the Scalding typed API but it might be useful for Cascalog too.

Regards, Chris

— Reply to this email directly or view it on GitHub.

Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

— Reply to this email directly or view it on GitHubhttps://github.com/bixolabs/cascading.avro/pull/7#issuecomment-9838506.

kkrugler commented 11 years ago

Hi all,

a. I tagged master in GitHub as 1.0

b. I merged in the 2.1-develop branch

c. I set the version to be 2.1.0 in both the scheme and maven-plugin sub-project pom.xml files

d. I added a section to both pom.xml files:

conjars Concurrent Conjars repository http://conjars.org/repo

If you then add an appropriate section to your ~/.m2/settings.xml file, you too can deploy to Conjars:

conjars a registered username the password

e. I was able to deploy the scheme without any issues.

One oddity, though, is that since we're using cascading.avro as the groupId, this means it shows up in Conjars at http://conjars.org/repo/cascading/avro/

So it's in the Cascading namespace (for the Maven repo). I assume that's OK with Chris Wensel/Concurrent, but I should double-check.

f. I had an issue with deploying the maven-plugin

"mvn deploy" kind of worked here - it uploaded the jar/pom and associated files, but I got this error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.5:deploy (default-deploy) on project avro-maven-plugin: Failed to deploy metadata: Could not transfer metadata org.apache.maven.artifact.repository.metadata.MetadataBridge@6c5bdfae from/to conjars (http://conjars.org/repo): Failed to transfer file: http://conjars.org/repo/cascading/avro/maven-metadata.xml. Return code is: 401 -> [Help 1]

I'm not sure why it's trying to write out a maven-metadata.xml file at the root of the cascading.avro package - probably something in the pom.xml would tell me, but I'm out of time today.

And I'm also not sure why this was rejected, but I assume it's a config setting for Conjars, where you can only create directories and then write files out to specific release dirs.

g. I tagged this version of the code as 2.1.0

h. I edited the pom.xml versions to be 2.2-SNAPSHOT, and pushed.

So we should be ready for further development.

Take a look, and if it seems good then we can post something to the mailing list.

Thanks!

-- Ken

On Oct 27, 2012, at 7:45pm, Mike Stanley wrote:

Sounds good to me too. I will come back around with patches, once I have a chance to take the 2.1.0 release for a spin.

... Mike Please excuse typos (fat thumbing an iPad)

On Oct 27, 2012, at 2:52 PM, Ken Krugler notifications@github.com wrote:

Hi all,

I'm going to look at merging 2.0-dev into master this weekend, with whatever is in that branch.

Then I'll do a 2.1.0 release to Conjars.

After that we can add in Mike's changes (hopefully as a pull request). I guess those would be a 2.2 release, since it's new functionality vs. just bug fixes.

Makes sense?

Thanks,

-- Ken

On Oct 26, 2012, at 9:17pm, Chris Severs wrote:

Hi Mike,

All those changes sound great. I can do the output codec if you don't want to worry about it but the others are probably best submitted as a patch. I think 2.0-develop will become master any day now and my guess is we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability to get the unpacked Avro record (similar to how SequenceFile support works) and also pass a packed Avro record to write out. I'm adding this to make it easier to use with the Scalding typed API but it might be useful for Cascalog too.

Regards, Chris

— Reply to this email directly or view it on GitHub.

Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

— Reply to this email directly or view it on GitHubhttps://github.com/bixolabs/cascading.avro/pull/7#issuecomment-9838506. — Reply to this email directly or view it on GitHub.

http://about.me/kkrugler +1 530-210-6378

kkrugler commented 11 years ago

This is from https://github.com/bixolabs/cascading.avro/pull/7

I'm hoping Mike Stanley can get back in sync with his modification.

dkincaid commented 9 years ago

Is there any hope of getting these changes in? Right now it seems that this is not usable at all from Cascalog.

mikestanley commented 9 years ago

I will take a look this week. No guarantees. It's been a long time since I needed anything further from this particular code and its literally just been running on autopilot. I'm probably years off the latest stuff. That said, I'm guessing the changes are still pretty relevant. I will happily look to see if I can bring it forward as a pull request.

dkincaid commented 9 years ago

Thanks, Mike. I took a look at it today, but couldn't figure out where the change was needed myself.

dkincaid commented 8 years ago

Could someone at least point me to where in the code the changes to support Cascalog field names would need to be made?

kkrugler commented 8 years ago

Hi Dave. From what I can tell, the bulk of @mikestanley 's changes are at https://github.com/mikestanley/cascading.avro/commit/330d1f02f73d145cdb6fc0fbf603fa3e04143cf1. There is a (small) change as well at https://github.com/mikestanley/cascading.avro/commit/47bc6c704922580b927e9cafe3be36537a20e6d7.

ScaleUnlimited / cascading.avro

Changes to allow the use of cascading.avro in Cascalog #11