Closed tolomaus closed 7 years ago
Hi Tolomaus,
Are you using the ParquetSource or HiveSource to read your parquet files?
Sam what do you think?
Hi Hannes,
I'm using the ParquetSource.
The log message was generated from https://github.com/eel-sdk/eel/blob/master/eel-components/src/main/scala/io/eels/component/avro/AvroSchemaFns.scala#L53 which doesn't include ARRAY even though it exists in the avro enum "Type"
Yes that's what I was looking at - I have mentioned this to Sam - see what he has to say.
He mentioned briefly there might be another library that might be useful to you.
With the help of some debugging I have been able to get it to work with the following lines of code:
.map { row =>
val word = row.values(0).asInstanceOf[String]
val vector = row.values(1).asInstanceOf[util.ArrayList[Double]].toArray.map(_.asInstanceOf[Record].get(0).asInstanceOf[Double])
(word, vector)
}
If there is an easier way to get to the same result I would be happy to know.
Thanks
Wait for Sam to get back...a personal preference I would prefer pattern matching like something below:
import scala.collection.JavaConversions._
.map { row =>
row.values(0) match {
case word:String =>
( word, match row.values(1) match {
case v:util.List[Double] =>
v.map { e =>
e match {
case r:Record => r(0).asInstanceOf[Double])
}
}
}
)
}
}
}
Sorry about the indentation :-(
Thanks for the suggestions!
I have been able to bring it down to the following lines:
.map { row =>
val word = row.values(0) match { case s: String => s }
val vector = row.values(1) match { case list: java.util.Collection[_] => list.asScala.map(_.asInstanceOf[Record].get(0).asInstanceOf[Double])}
(word, vector)
}
I'm looking at avro4s to convert the parquet data into case classes but it works on a Record, while eel gives me a Row (which it generated itself from a Record internally).
I think we just need to update AvroSchemaFns
to support arrays/structs/records. Let's put that on the roadmap for 1.1 next week @hannesmiller
This is now in master. I've written support for arrays and records, and then generated a 5mb parquet file which I will use for issue #180
Turns out my support wasn't completed, it didn't support arrays of non-primitives. So adding that now, but I have to change the way schemas are defined first.
Thanks, as soon as 1.1.0 is released I will give it a try
There's a 1.1.0 snapshot fyi
On 29 Nov 2016 11:50 a.m., "Niek Bartholomeus" notifications@github.com wrote:
Thanks, as soon as 1.1.0 is released I will give it a try
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sksamuel/eel-sdk/issues/177#issuecomment-263550442, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtZGjfF5ulBMhPxOUAvsXT-knHLSVUTks5rDBFvgaJpZM4KnZ4a .
Thanks, I tried with
lazy val eels = RootProject(uri("https://github.com/sksamuel/eel-sdk.git"))
lazy val root = (project in file("."))
.enablePlugins(PlayScala)
.dependsOn(eels)
but although sbt update
did its job correctly the compiler can't seem to find the library. Could it be that your sbt config needs some special handling here?
I think you need to add maven snapshot repo.
On 29 Nov 2016 1:27 p.m., "Niek Bartholomeus" notifications@github.com wrote:
Thanks, I tried with
lazy val eels = RootProject(uri("https://github.com/sksamuel/eel-sdk.git")) lazy val root = (project in file(".")) .enablePlugins(PlayScala) .dependsOn(eels)
but although sbt update did its job correctly the compiler can't seem to find the library. Could it be that your sbt config needs some special handling here?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sksamuel/eel-sdk/issues/177#issuecomment-263569621, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtZGrnCjGfQk4KaOpK6vgWMh7WopCa6ks5rDChTgaJpZM4KnZ4a .
Cool I got the snapshot loaded now.
I tried following code but it failed with a cast exception:
val vector = row.values(1).asInstanceOf[Array[Double]]
val vector = row.values(1).asInstanceOf[util.ArrayList[Double]].asScala.toArray
Here is the existing code that still works:
val vector = row.values(1).asInstanceOf[util.ArrayList[Double]].toArray.map(_.asInstanceOf[Record].get(0).asInstanceOf[Double])
I had a look at the commits but couldn't derive how to get arrays supported. Any idea?
It should be converting them back into a normal array.
val vector = row.values(1).asInstanceOf[Array[Double]]
I wrote this to test the files that you reported as slow. There's still optimisation work I need to do, but this shows how arrays are used.
Nice, I was able to pinpoint the issue: it looks like the toVector
in your code doesn't actually unbox the double. Could you try with toArray
instead? This should trigger the error.
Oops sorry I mixed up my wordings, forget my previous comment.
If you add the following line on line 69 of https://github.com/sksamuel/eel-sdk/blob/master/eel-components/src/test/scala/io/eels/component/parquet/ParquetReadTest.scala:
.map(row => row.values(1).asInstanceOf[Array[Double]])
then it should trigger the cast error.
Hi,
I'm using eel to read a parquet file with the following schema (created in spark): case class WordVector(word: String, vector: Array[Double])
But it looks like arrays are not supported at the moment. Is this on the roadmap?
Many thanks