Open kmader opened 8 years ago
@kmader thanks for the interest in this!
We thought about trying something similar a while back. Recently though, we have been leaning more toward removing the local version of the BoltArray
entirely and making the project completely about a distributed ndarray. The thought was that this might be best in terms of modularity -- then a higher level project could wrap the NumPy array, Bolt array, and maybe even the Dask array.
Would be interested to hear your thoughts on this.
@jwittenbach That seems like it might be a good direction to go in. I started on a localspark
implementation that simply faked the Spark backend so commands like chunk
and unchunk
work the same with or without pyspark (https://github.com/bolt-project/bolt/pull/101/files)
Another approach would be to have a generic ParallelBackend for BoltArrays that could be implemented in Spark, Flink, Multiprocessing, etc and allow the operations to all be done the same way.
For testing and developing functions it would be nice if Local and Spark Arrays had identical functions. This might be easiest by a faking SparkContext and RDD and using BoltArraySpark for everything.