Issue Indexing AVRO Files in Ruby

epicycle commented 11 years ago

I've modified the hadoop mapper / reducer to work with CDH 4.1.2. This was mostly modifying ivy and upgrading Avro to 1.7.3 with the hadoop2 classifier but I also had to change MyAvroMultipleOutputs to use TaskAttemptContextImpl instead of TaskAttemptContext.

The log upload, mappers, and reducers seemed to work fine. I'm now stuck on the server side with an odd error in HyperSQL and Ruby.

Found 4 files to process /staging/white-elephant/apache-tomcat-7.0.42/webapps/WhiteElephant/WEB-INF/app/usage_loader.rb:185 warning: ambiguous Java methods found, using submit(java.util.concurrent.Callable) Failed loading file hdfs://cluster-company/data/hadoop/stats/usage-per-hour/cluster-company/2013/0828/part-r-00000.avro: data exception: string data, right truncation org.hsqldb.jdbc.JDBCPreparedStatement.executeBatch(Unknown Source) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:440) org.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:304) org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:52) org.jruby.internal.runtime.methods.AliasMethod.call(AliasMethod.java:56) org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:306) org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:136) org.jruby.ast.CallNoArgNode.interpret(CallNoArgNode.java:64) org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105) org.jruby.ast.BlockNode.interpret(BlockNode.java:71) org.jruby.ast.IfNode.interpret(IfNode.java:116) org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105) org.jruby.ast.BlockNode.interpret(BlockNode.java:71) org.jruby.ast.RescueNode.executeBody(RescueNode.java:224) org.jruby.ast.RescueNode.interpret(RescueNode.java:119) org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:75) org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:112) org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:154) UsageFileLoadTask_1496665911.call(UsageFileLoadTask_1496665911.gen:13) java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) java.util.concurrent.FutureTask.run(FutureTask.java:138) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662)

Cleaning up data for file hdfs://cluster-company/data/hadoop/stats/usage-per-hour/cluster-company/2013/0828/part-r-00000.avro with ID 3

The avro files seem to be fine but it's hard to tell as this is my first time using White Elephant. The parsed log file avro files are certainly a lot larger, coming in around 80mb each whereas the hourly files are 2-12kb each.

Has anyone else run into this problem? Any ideas where to go from here? Could this be an LZO decompression issue?

Thanks for the help.

epicycle commented 11 years ago

I tried disabling compression on the job output in the base.properties and regenerating all of the data but that didn't help. Any help anyone can offer from here would be appreciated.

matthayes commented 11 years ago

Hmm I don't recognize this error. Do you have a patch or a repo where I can try testing out your changes to reproduce it?

epicycle commented 11 years ago

After digging into the error online it turns out the schema was to constrained for our usage. Our usernames are longer, our file names are longer, etc. I changed all of the varchar's in the usage_database.rb file to larger and wala it works! For now I made all of the varchar's 50 and for the filename I made that CHAR VARYING(5000) instead of varchar.

matthayes commented 11 years ago

Ah wonderful :)

cscetbon commented 10 years ago

Those types should be changed .. I got the same error and it took time to find it :(

LinkedInAttic / white-elephant

Issue Indexing AVRO Files in Ruby #12

Cleaning up data for file hdfs://cluster-company/data/hadoop/stats/usage-per-hour/cluster-company/2013/0828/part-r-00000.avro with ID 3