51zero / eel-sdk

Big Data Toolkit for the JVM
Apache License 2.0
145 stars 35 forks source link

Support for Parquet arrays #177

Closed tolomaus closed 7 years ago

tolomaus commented 7 years ago

Hi,

I'm using eel to read a parquet file with the following schema (created in spark): case class WordVector(word: String, vector: Array[Double])

But it looks like arrays are not supported at the moment. Is this on the roadmap?

Many thanks

hannesmiller commented 7 years ago

Hi Tolomaus,

Are you using the ParquetSource or HiveSource to read your parquet files?

Sam what do you think?

tolomaus commented 7 years ago

Hi Hannes,

I'm using the ParquetSource.

The log message was generated from https://github.com/eel-sdk/eel/blob/master/eel-components/src/main/scala/io/eels/component/avro/AvroSchemaFns.scala#L53 which doesn't include ARRAY even though it exists in the avro enum "Type"

hannesmiller commented 7 years ago

Yes that's what I was looking at - I have mentioned this to Sam - see what he has to say.

He mentioned briefly there might be another library that might be useful to you.

tolomaus commented 7 years ago

With the help of some debugging I have been able to get it to work with the following lines of code:

.map { row =>
  val word = row.values(0).asInstanceOf[String]
  val vector = row.values(1).asInstanceOf[util.ArrayList[Double]].toArray.map(_.asInstanceOf[Record].get(0).asInstanceOf[Double])
  (word, vector)
}

If there is an easier way to get to the same result I would be happy to know.

Thanks

hannesmiller commented 7 years ago

Wait for Sam to get back...a personal preference I would prefer pattern matching like something below:

import scala.collection.JavaConversions._
.map { row => 
   row.values(0) match  {
       case word:String =>
          ( word, match row.values(1) match  {
              case v:util.List[Double] =>
                  v.map { e => 
                      e match {
                          case r:Record => r(0).asInstanceOf[Double])
                      }
                 }
             }
          )
       }
   }
}
hannesmiller commented 7 years ago

Sorry about the indentation :-(

tolomaus commented 7 years ago

Thanks for the suggestions!

I have been able to bring it down to the following lines:

.map { row =>
  val word = row.values(0) match { case s: String => s }
  val vector = row.values(1) match { case list: java.util.Collection[_] => list.asScala.map(_.asInstanceOf[Record].get(0).asInstanceOf[Double])}
  (word, vector)
}

I'm looking at avro4s to convert the parquet data into case classes but it works on a Record, while eel gives me a Row (which it generated itself from a Record internally).

sksamuel commented 7 years ago

I think we just need to update AvroSchemaFns to support arrays/structs/records. Let's put that on the roadmap for 1.1 next week @hannesmiller

sksamuel commented 7 years ago

This is now in master. I've written support for arrays and records, and then generated a 5mb parquet file which I will use for issue #180

sksamuel commented 7 years ago

Turns out my support wasn't completed, it didn't support arrays of non-primitives. So adding that now, but I have to change the way schemas are defined first.

tolomaus commented 7 years ago

Thanks, as soon as 1.1.0 is released I will give it a try

sksamuel commented 7 years ago

There's a 1.1.0 snapshot fyi

On 29 Nov 2016 11:50 a.m., "Niek Bartholomeus" notifications@github.com wrote:

Thanks, as soon as 1.1.0 is released I will give it a try

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sksamuel/eel-sdk/issues/177#issuecomment-263550442, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtZGjfF5ulBMhPxOUAvsXT-knHLSVUTks5rDBFvgaJpZM4KnZ4a .

tolomaus commented 7 years ago

Thanks, I tried with

lazy val eels = RootProject(uri("https://github.com/sksamuel/eel-sdk.git"))

lazy val root = (project in file("."))
  .enablePlugins(PlayScala)
  .dependsOn(eels)

but although sbt update did its job correctly the compiler can't seem to find the library. Could it be that your sbt config needs some special handling here?

sksamuel commented 7 years ago

I think you need to add maven snapshot repo.

On 29 Nov 2016 1:27 p.m., "Niek Bartholomeus" notifications@github.com wrote:

Thanks, I tried with

lazy val eels = RootProject(uri("https://github.com/sksamuel/eel-sdk.git")) lazy val root = (project in file(".")) .enablePlugins(PlayScala) .dependsOn(eels)

but although sbt update did its job correctly the compiler can't seem to find the library. Could it be that your sbt config needs some special handling here?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sksamuel/eel-sdk/issues/177#issuecomment-263569621, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtZGrnCjGfQk4KaOpK6vgWMh7WopCa6ks5rDChTgaJpZM4KnZ4a .

tolomaus commented 7 years ago

Cool I got the snapshot loaded now.

I tried following code but it failed with a cast exception:

    val vector = row.values(1).asInstanceOf[Array[Double]]
    val vector = row.values(1).asInstanceOf[util.ArrayList[Double]].asScala.toArray

Here is the existing code that still works:

    val vector = row.values(1).asInstanceOf[util.ArrayList[Double]].toArray.map(_.asInstanceOf[Record].get(0).asInstanceOf[Double])

I had a look at the commits but couldn't derive how to get arrays supported. Any idea?

sksamuel commented 7 years ago

It should be converting them back into a normal array.

val vector = row.values(1).asInstanceOf[Array[Double]]

sksamuel commented 7 years ago

I wrote this to test the files that you reported as slow. There's still optimisation work I need to do, but this shows how arrays are used.

https://github.com/sksamuel/eel-sdk/blob/master/eel-components/src/test/scala/io/eels/component/parquet/ParquetReadTest.scala

tolomaus commented 7 years ago

Nice, I was able to pinpoint the issue: it looks like the toVector in your code doesn't actually unbox the double. Could you try with toArray instead? This should trigger the error.

tolomaus commented 7 years ago

Oops sorry I mixed up my wordings, forget my previous comment.

If you add the following line on line 69 of https://github.com/sksamuel/eel-sdk/blob/master/eel-components/src/test/scala/io/eels/component/parquet/ParquetReadTest.scala:

.map(row => row.values(1).asInstanceOf[Array[Double]])

then it should trigger the cast error.