hannesmiller commented 7 years ago

The following code:

  def JdbcSinkToOrc: Unit = {
    implicit val hadoopConfiguration = new Configuration()
    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration)
    // Write to a OrcSink from a JDBCSource
    val query = "SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON"
    val orcFilePath = new Path("hdfs://nameservice1/client/eel/person.orc")
    if (hadoopFileSystem.exists(orcFilePath)) hadoopFileSystem.delete(orcFilePath, true)
    JdbcSource(() => dataSource.getConnection, query).withFetchSize(10)
      .toFrame.to(OrcSink(orcFilePath))
  }

Produces the follow stack trace:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.serde2.io.HiveDecimalWritable.<init>(J)V
at org.apache.orc.impl.ColumnStatisticsImpl$DecimalStatisticsImpl.<init>(ColumnStatisticsImpl.java:789)
at org.apache.orc.impl.ColumnStatisticsImpl.create(ColumnStatisticsImpl.java:1384)
at org.apache.orc.impl.WriterImpl$TreeWriter.<init>(WriterImpl.java:511)
at org.apache.orc.impl.WriterImpl$DecimalTreeWriter.<init>(WriterImpl.java:1977)
at org.apache.orc.impl.WriterImpl.createTreeWriter(WriterImpl.java:2526)
at org.apache.orc.impl.WriterImpl.access$1600(WriterImpl.java:95)
at org.apache.orc.impl.WriterImpl$StructTreeWriter.<init>(WriterImpl.java:2074)
at org.apache.orc.impl.WriterImpl.createTreeWriter(WriterImpl.java:2529)
at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:191)
at org.apache.orc.OrcFile.createWriter(OrcFile.java:671)
at io.eels.component.orc.OrcWriter.writer$lzycompute(OrcWriter.scala:53)
at io.eels.component.orc.OrcWriter.writer(OrcWriter.scala:53)
at io.eels.component.orc.OrcWriter.flush(OrcWriter.scala:83)
at io.eels.component.orc.OrcWriter.close(OrcWriter.scala:94)
at io.eels.component.orc.OrcSink$$anon$1.close(OrcSink.scala:35)
at io.eels.actions.SinkAction$.execute(SinkAction.scala:19)
at io.eels.Frame$class.to(Frame.scala:345)
at io.eels.SourceFrame.to(SourceFrame.scala:17)
at io.eels.testing.EelSourceToSinks$.JdbcSinkToOrc(EelSourceToSinks.scala:93)
at io.eels.testing.EelSourceToSinks$.delayedEndpoint$io$eels$testing$EelSourceToSinks$1(EelSourceToSinks.scala:56)
at io.eels.testing.EelSourceToSinks$delayedInit$body.apply(EelSourceToSinks.scala:23)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)

sksamuel commented 7 years ago

It must be because orc 1.3.0 is now used. It might use a newer hadoop.

On 30 Jan 2017 7:02 a.m., "hannesmiller" notifications@github.com wrote:

The following code:

def JdbcSinkToOrc: Unit = { implicit val hadoopConfiguration = new Configuration() implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // Write to a OrcSink from a JDBCSource val query = "SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON" val orcFilePath = new Path("hdfs://gcstrd03.de.db.com/client/eel/person.orc") if (hadoopFileSystem.exists(orcFilePath)) hadoopFileSystem.delete(orcFilePath, true) JdbcSource(() => dataSource.getConnection, query).withFetchSize(10) .toFrame.to(OrcSink(orcFilePath)) }

Produces the follow stack trace:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.serde2.io.HiveDecimalWritable.(J)V at org.apache.orc.impl.ColumnStatisticsImpl$DecimalStatisticsImpl.(ColumnStatisticsImpl.java:789) at org.apache.orc.impl.ColumnStatisticsImpl.create(ColumnStatisticsImpl.java:1384) at org.apache.orc.impl.WriterImpl$TreeWriter.(WriterImpl.java:511) at org.apache.orc.impl.WriterImpl$DecimalTreeWriter.(WriterImpl.java:1977) at org.apache.orc.impl.WriterImpl.createTreeWriter(WriterImpl.java:2526) at org.apache.orc.impl.WriterImpl.access$1600(WriterImpl.java:95) at org.apache.orc.impl.WriterImpl$StructTreeWriter.(WriterImpl.java:2074) at org.apache.orc.impl.WriterImpl.createTreeWriter(WriterImpl.java:2529) at org.apache.orc.impl.WriterImpl.(WriterImpl.java:191) at org.apache.orc.OrcFile.createWriter(OrcFile.java:671) at io.eels.component.orc.OrcWriter.writer$lzycompute(OrcWriter.scala:53) at io.eels.component.orc.OrcWriter.writer(OrcWriter.scala:53) at io.eels.component.orc.OrcWriter.flush(OrcWriter.scala:83) at io.eels.component.orc.OrcWriter.close(OrcWriter.scala:94) at io.eels.component.orc.OrcSink$$anon$1.close(OrcSink.scala:35) at io.eels.actions.SinkAction$.execute(SinkAction.scala:19) at io.eels.Frame$class.to(Frame.scala:345) at io.eels.SourceFrame.to(SourceFrame.scala:17) at io.eels.testing.EelSourceToSinks$.JdbcSinkToOrc(EelSourceToSinks.scala:93) at io.eels.testing.EelSourceToSinks$.delayedEndpoint$io$eels$testing$EelSourceToSinks$1(EelSourceToSinks.scala:56) at io.eels.testing.EelSourceToSinks$delayedInit$body.apply(EelSourceToSinks.scala:23) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sksamuel/eel-sdk/issues/234, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtZGqfStt6cRTR6EpC7lAyNkfKbW_KMks5rXYsRgaJpZM4LxGP- .

hannesmiller commented 7 years ago

Yeah it appears the implementation for Orc depends on a Hive Serde - the offending exception is looking for a HiveDecimalWritable constructor with a single String argument - check out org.apache.orc.impl.ColumnStatisticsImpl - for this scenario Parquet is better because there isn't an underlying dependency on Hive.

sksamuel commented 7 years ago

Do you think we should downgrade?

hannesmiller commented 7 years ago

Yes I think so as upgrading runs the risk of other source and sinks breaking.

omalley commented 7 years ago

Let me take a look at what is causing the problem. I assume it is a conflict with the version of Hive.

omalley commented 7 years ago

Ok, it was the change that Matt put into Hive's storage api to speed up decimals (HIVE-15335). It looks like I'll need to roll a new release of hive-storage-api and ORC.

sksamuel commented 7 years ago

Thanks @omalley , we'll upgrade once its released.

hannesmiller commented 7 years ago

In the interim I still think we should downgrade until the new release of hive-storage-api for ORC is available.

I get errors where there are non-decimal types as well, e.g.:

        Path orcFilePath = new Path("hdfs://nameservice1/client/eel_java/person.orc");
        Configuration hadoopConfiguration = new Configuration();
        FileSystem hadoopFileSystem = FileSystem.get(hadoopConfiguration);
        if (hadoopFileSystem.exists(orcFilePath)) hadoopFileSystem.delete(orcFilePath, true);

        // Build schema
        StructType schema = StructTypeBuilder.builder()
                .withField("Name", Types4j.StringType)
                .withField("Age", Types4j.IntSignedType)
                .build();

        // Create rows and write them
        OrcSink4j orcSink = new OrcSink4j(orcFilePath);
        RowBuilder.Builder rowBuilder = RowBuilder.builder(schema);
        List<Row> rows = Arrays.asList
                (
                        rowBuilder.add("Fred", 21).build(),
                        rowBuilder.reset().add("John", 28).build(),
                        rowBuilder.reset().add("Alice", 17).build()
                );
        orcSink.write(rows);

        // Read and display
        OrcSource4j orcSource = new OrcSource4j(orcFilePath);
        orcSource
                .toFrame()
                .toList().forEach(System.out::println);

Exception

Exception in thread "pool-5-thread-1" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch.getDataColumnCount()I
    at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1096)
    at io.eels.component.orc.OrcBatchIterator$$anon$1.hasNext(OrcBatchIterator.scala:56)
    at io.eels.CloseableIterator$$anon$1.hasNext(CloseableIterator.scala:10)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at io.eels.CloseableIterator$$anon$1.foreach(CloseableIterator.scala:9)
    at io.eels.CloseableIterator$class.foreach(CloseableIterator.scala:18)
    at io.eels.component.orc.OrcPart$$anon$1.foreach(OrcSource.scala:48)
    at io.eels.SourceFrame$$anonfun$rows$1$$anonfun$apply$1.apply$mcV$sp(SourceFrame.scala:32)
    at io.eels.SourceFrame$$anonfun$rows$1$$anonfun$apply$1.apply(SourceFrame.scala:30)
    at io.eels.SourceFrame$$anonfun$rows$1$$anonfun$apply$1.apply(SourceFrame.scala:30)
    at scala.util.Try$.apply(Try.scala:192)
    at com.sksamuel.exts.concurrent.ExecutorImplicits$RichExecutorService$$anon$2.run(ExecutorImplicits.scala:28)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

looks like its using the wrong version of hive-exec , e.g. org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch

omalley commented 7 years ago

It looks like I'll need:

HIVE-15929 to make hive-storage-api 2.2.1
ORC-147 to make orc-core 1.3.3
We'll need to update your dependencies so that you use:
- org.apache.hive:hive-storage-api:2.2.1
- org.apache.orc:orc-core:1.3.3
- org.apache.hive:hive-exec:2.1.1 with classifier "core" and excluding:
  - org.apache.hive:hive-storage-api
  - org.apache.hive:hive-orc

That will give you a set of versions that work together. I'm sorry for the issue.

hannesmiller commented 7 years ago

No problem @omalley - thanks for the prompt response.

Sam, I think this is doable... I don't think the exclusions/inclusions of the Artefacts in SBT are going to impact the Hive Source and Sinks for other dialects right?

If we could get an M2 in central...I'll test it ASAP.

Cheers, Hannes

omalley commented 7 years ago

Apache releases take 3 days, so it will be early next week for the new releases.

hannesmiller commented 7 years ago

No problem

omalley commented 7 years ago

Ok, the storage-api and orc releases have been made and pushed to Maven central. Here's a patch that updates the dependencies. https://github.com/omalley/eel-sdk/tree/orc-upgrade

I didn't realize that you were using the older 1.2 version of Hive, so there are some compilation problems in your hive module that I don't have time to track down.

sksamuel commented 7 years ago

I think we should bump our version of Hive to something more modern @hannesmiller

omalley commented 7 years ago

The patch I included bumps it to Hive 2.1.1, which is the current. That is why your hive module has issues.

sksamuel commented 7 years ago

Thanks @omalley

51zero / eel-sdk

Something has regressed on the M1 release for Orc - NoSuchMethodError HiveDecimalWritable #234

I get errors where there are non-decimal types as well, e.g.:

Exception