Low throughput - Githubissues

JulienPeloton commented 6 years ago

The current throughput is around 5-10 MB/sec to load and convert FITS data to DataFrame. The decoding lib needs to be improved...

JulienPeloton commented 6 years ago

A bit better after https://github.com/JulienPeloton/spark-fits/pull/11. Now around 15 MB/s.

JulienPeloton commented 6 years ago

Current I/O benchmark (spark.readfits.load.count) on 110 GB (first iteration), with 128 MB partitions:

	Median task duration (s) [GC]	Comments
no reading/no decoding	0.5 [0.02]	Spark/Hadoop overhead
reading/no decoding	8 [0.04]	Overhead + I/O
reading/decoding	10 [1]	Overhead + I/O + Scala FITSIO

Most of the time is spent in reading from disk (>60%). This is the time spent in doing f.readFully(). Could be better? Note that I/O contains also the effect of data locality -- so it is not only reading file from local DataNode, but transferring data as well from remote DataNode.

Decoding is 30% of the total (with large GC time).

astrolabsoftware / spark-fits

Low throughput #10