astrolabsoftware / spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
https://astrolabsoftware.github.io/spark3D/
Apache License 2.0
30 stars 16 forks source link

Tools to perform live 3D RDD visualisation #93

Closed JulienPeloton closed 5 years ago

JulienPeloton commented 5 years ago

This PR introduces tools to quickly visually inspect RDD.

When speaking of visualisation in Apache Spark, obviously the main problem is the data set size, which makes the visualisation challenging.

There are several tools in the market to deal with very large data sets, but most of them are not open source, or read their data only from files on disk, or are too complex to make anything simple in 2 sec (as you would like to have when prototyping). If you know one great that I could have missed, let me know!

I really wanted a simple tool to make live RDD visualisation (which translates basically to "hey, how look like the data in this RDD?", no matter the size of it).

In order to leverage this, we introduce collapse functions. Collapse functions are functions acting on the data set to make a representation of it. They are applied at the level of each partition, to reduce the size of the data set while keeping desired features. Obviously, that means the data in your RDD have some kind of ordered, for the procedure to be meaningful (in other words, the RDD has been repartitioned in some way).

Let's take a concrete example. I have a several million 3D points RDD from a FITS file:

from pyspark3d.spatial3DRDD import Point3DRDD

p3d = Point3DRDD(spark, fn, "x,y,z", False, "fits", {"hdu": hdu})

The data is a priori randomly distributed (spatially), so I need to re-partition it:

from pyspark3d.converters import toCoordRDD

# Perform the re-partitioning, and convert to Python RDD
crdd = toCoordRDD(p3d, gridtype, npart).cache()

Note that the re-partitioning is done in Scala under the hood, but we transfer a PythonRDD at the end whose elements are the coordinates of the 3D points (and drop all the other metadata). Let's now reduce the size of the data set by collapsing each partition into its centroid:

from pyspark3d.visualisation import CollapseFunctions
from pyspark3d.visualisation import collapse_rdd_data

# Collapse the data using a simple mean of each partition (centroid)
cf = CollapseFunctions()
data = collapse_rdd_data(crdd, cf.mean).collect()

And finally here is what I get!

test_collapse_function_mean