Open JulienPeloton opened 6 years ago
A bit better after https://github.com/JulienPeloton/spark-fits/pull/11. Now around 15 MB/s.
Current I/O benchmark (spark.readfits.load.count) on 110 GB (first iteration), with 128 MB partitions:
Median task duration (s) [GC] | Comments | |
---|---|---|
no reading/no decoding | 0.5 [0.02] | Spark/Hadoop overhead |
reading/no decoding | 8 [0.04] | Overhead + I/O |
reading/decoding | 10 [1] | Overhead + I/O + Scala FITSIO |
Most of the time is spent in reading from disk (>60%).
This is the time spent in doing f.readFully()
. Could be better?
Note that I/O contains also the effect of data locality -- so it is not only reading file from local DataNode, but transferring data as well from remote DataNode.
Decoding is 30% of the total (with large GC time).
The current throughput is around 5-10 MB/sec to load and convert FITS data to DataFrame. The decoding lib needs to be improved...