dfdx / Spark.jl

Julia binding for Apache Spark
Other
205 stars 39 forks source link

Fix for large RDD #43

Closed aviks closed 7 years ago

aviks commented 7 years ago

When a RDD is reduced, the reduce is run on each partition, and then one row per partition is brought to the master and locally reduced. This last step causes the rows to be serialised from Scala to Julia. This serialisation copies all the rows into a single Java ByteArray.

All Java arrays are indexed by 32 bit integers, and thus a Java ByteArray has a limit of 2GB limit for storage. Hence, whenever the size of a single RDD row multiplied by the number of partitions is larger than 2GB, the serialisation, and hence the collect fails.

This PR changes the code to serialise the RDD rows one at a time, via a custom Iterator. Since reduce only needs to operate on 2 rows at a time, this works well. There is a minor performance improvement as well.

This still leaves a fundamental limit of 2GB for each row in an RDD. Larger rows will likely break in many places, so that is something we might have to just live with.

dfdx commented 7 years ago

I guess there will be much more things that will go wrong with rows of 2Gb :D Thanks for this improvement!