astrolabsoftware / spark-fits

FITS data source for Spark SQL and DataFrames
https://astrolabsoftware.github.io/spark-fits/
Apache License 2.0
20 stars 7 forks source link

Low throughput #10

Open JulienPeloton opened 6 years ago

JulienPeloton commented 6 years ago

The current throughput is around 5-10 MB/sec to load and convert FITS data to DataFrame. The decoding lib needs to be improved...

JulienPeloton commented 6 years ago

A bit better after https://github.com/JulienPeloton/spark-fits/pull/11. Now around 15 MB/s.

JulienPeloton commented 6 years ago

Current I/O benchmark (spark.readfits.load.count) on 110 GB (first iteration), with 128 MB partitions:

Median task duration (s) [GC] Comments
no reading/no decoding 0.5 [0.02] Spark/Hadoop overhead
reading/no decoding 8 [0.04] Overhead + I/O
reading/decoding 10 [1] Overhead + I/O + Scala FITSIO

Most of the time is spent in reading from disk (>60%). This is the time spent in doing f.readFully(). Could be better? Note that I/O contains also the effect of data locality -- so it is not only reading file from local DataNode, but transferring data as well from remote DataNode.

Decoding is 30% of the total (with large GC time).