PengNi / basemods_spark

0 stars 0 forks source link

reading h5 file to RDD in spark #1

Closed PengNi closed 6 years ago

PengNi commented 6 years ago

solution 1 : reading h5 file to maste node's memory, then use sc.parallelize to distribute to each worker, but this solution are using huge memory of master node.

solution 2: using spark to transform h5 file into text file then into HDFS. But it is too slow.

solution 3: write an interface in spark's source code for accessing h5 file in HDFS.

How to solve it?

@litao-csu , @akakZhang, please keep this issue in mind.

akakZhang commented 6 years ago

WAY 1: Maybe should try use SequenceFile to save each dataset in a h5 file.

PengNi commented 6 years ago

WAY 2: building a Network File System is a way.

PengNi commented 6 years ago

WAY 3: check sc.parallelize() in spark's source code, do some test.

PengNi commented 6 years ago

this is the question on stackoverflow.

PengNi commented 6 years ago

Related Works:

  1. valiantljk: h5spark
  2. LLNL: Spark-HDF5
  3. NASA: SciSpark
PengNi commented 6 years ago

the best way we are using right now, is that first save HDF5 files in HDFS, then generate an RDD of HDF5 file names/paths, then use "hdfs dfs -get" command to get HDF5 files from HDFS to local file system.