When a RDD is reduced, the reduce is run on each partition, and then one row per partition is brought to the master and locally reduced. This last step causes the rows to be serialised from Scala to Julia. This serialisation copies all the rows into a single Java ByteArray.
All Java arrays are indexed by 32 bit integers, and thus a Java ByteArray has a limit of 2GB limit for storage. Hence, whenever the size of a single RDD row multiplied by the number of partitions is larger than 2GB, the serialisation, and hence the collect fails.
This PR changes the code to serialise the RDD rows one at a time, via a custom Iterator. Since reduce only needs to operate on 2 rows at a time, this works well. There is a minor performance improvement as well.
This still leaves a fundamental limit of 2GB for each row in an RDD. Larger rows will likely break in many places, so that is something we might have to just live with.
When a RDD is reduced, the reduce is run on each partition, and then one row per partition is brought to the master and locally reduced. This last step causes the rows to be serialised from Scala to Julia. This serialisation copies all the rows into a single Java ByteArray.
All Java arrays are indexed by 32 bit integers, and thus a Java ByteArray has a limit of 2GB limit for storage. Hence, whenever the size of a single RDD row multiplied by the number of partitions is larger than 2GB, the serialisation, and hence the
collect
fails.This PR changes the code to serialise the RDD rows one at a time, via a custom
Iterator
. Since reduce only needs to operate on 2 rows at a time, this works well. There is a minor performance improvement as well.This still leaves a fundamental limit of 2GB for each row in an RDD. Larger rows will likely break in many places, so that is something we might have to just live with.