databricks / spark-avro

Avro Data Source for Apache Spark
http://databricks.com/
Apache License 2.0
539 stars 310 forks source link

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

Open arthurdk opened 6 years ago

arthurdk commented 6 years ago

Hello,

We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1 to com.databricks:spark-avro_2.11:4.0.0.

We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:

df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- _0: string (nullable = true)
|    |    |-- _1: integer (nullable = true)
|    |    |-- _2: long (nullable = true)

In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.

At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:

val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")

And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.

To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:

val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
|        avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108|        1|   143220|
+-----------------+---------+---------+

Could you look into this? If you need any additional information feel free to ask!

gengliangwang commented 6 years ago

The read and write path is indeed slower in current release. For 2.0.1 version:

read path:   Avro => Row
write path: Row => Avro

while in 4.0:

read path:   Avro => Row => InternalRow
write path:  InternalRow => Row => Avro

The conversion between Row and InternalRow is slow.

The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

In the next release, this problem should be fixed as:

read path:   Avro => InternalRow
write path:  InternalRow => Avro
arthurdk commented 6 years ago

Many thanks for the explanation! Do you have an ETA for the next release?

gengliangwang commented 6 years ago

There is not ETA yet. I will comment this issue once fixed.

ryanivanka commented 6 years ago

May I ask if this issue is already fixed?

Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.