Open arthurdk opened 6 years ago
The read and write path is indeed slower in current release. For 2.0.1 version:
read path: Avro => Row
write path: Row => Avro
while in 4.0:
read path: Avro => Row => InternalRow
write path: InternalRow => Row => Avro
The conversion between Row
and InternalRow
is slow.
The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
In the next release, this problem should be fixed as:
read path: Avro => InternalRow
write path: InternalRow => Avro
Many thanks for the explanation! Do you have an ETA for the next release?
There is not ETA yet. I will comment this issue once fixed.
May I ask if this issue is already fixed?
Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.
Hello,
We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from
com.databricks:spark-avro_2.10:2.0.1
tocom.databricks:spark-avro_2.11:4.0.0
.We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:
In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.
At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:
And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.
To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:
Could you look into this? If you need any additional information feel free to ask!