Closed PengNi closed 6 years ago
WAY 1: Maybe should try use SequenceFile to save each dataset in a h5 file.
WAY 2: building a Network File System is a way.
WAY 3: check sc.parallelize() in spark's source code, do some test.
Related Works:
the best way we are using right now, is that first save HDF5 files in HDFS, then generate an RDD of HDF5 file names/paths, then use "hdfs dfs -get" command to get HDF5 files from HDFS to local file system.
solution 1 : reading h5 file to maste node's memory, then use sc.parallelize to distribute to each worker, but this solution are using huge memory of master node.
solution 2: using spark to transform h5 file into text file then into HDFS. But it is too slow.
solution 3: write an interface in spark's source code for accessing h5 file in HDFS.
How to solve it?
@litao-csu , @akakZhang, please keep this issue in mind.