Re-implementation of kd-tree cross-matching

hombit commented 1 year ago

Problem The current KdTreeCrossmatch implementation uses gnomonic projection, which could be inaccurate for large healpix tiles, especially in the polar regions.

Proposed solution I propose to use 3-D tree based on xyz coordinates on the unit sphere. It would:

Give the exact matches, because minimum 3-D distance corresponds to minimum great circle distance (minimum chord = minimum arc)
Distances given by the kd-tree could be directly used for the search radius filtering, we just need to convert search radius to the corresponding chord length

Performance improvement opportunities The direct implementation of the proposed solution could have worse performance, because both kd-tree building and looking-up algorithms scale as ∼k. However, the proposed solution could apply few tricks to make it more performant:

We can drop almost all the math associated with _filter_indexes_to_threshold and _great_circle_dist. The replacement of frame_gnomonic with spherical-to-euclidian coordinate transformation should have no-to-little performance impact.
Practically, most of the cross-matches would have small radius and average number of matches to be less than n_neighbors (even if it is 1). That's why I believe that it would be beneficial to swap the query operations: first query for radius, and then filter by the first n_neighbors. With the current kd-tree realization this order would lead to even less accurate results.
We can set the origin of the xyz coordinate system near the tile (for example rotating spherical system to put the tile to the north pole, and shifting the origin to x=0, y=0, z=1). It would allow us to use float32 dtype for the xyz positions and KDTree which would reduce memory usage

maxwest-uw commented 1 year ago

since there are some trade-offs in speed and memory vs. accuracy, maybe it would be better to add a new cross-match routine and have the unit-sphere calculations be their own separate version, so that the user can weigh their computational needs themselves?

hombit commented 1 year ago

I generally agree that having a choice is good. However I have many concerns about having two close but a bit different implementations: 1) Right now It is not clear to me if 3d-tree implementation would be (much) slower 2) Supporting two things is harder than supporting one thing 3) The most important one, I don't think that it is a good idea to give an inaccurate solution, which could lead to the results different from other implementations (including algorithms implemented not in LSDB). From the user's perspective it is not even clear what accuracy this solution has: large healpix tiles could lead to errors much beyond the single-float point accuracy. So even if we tell that this algorithm is faster, but less accurate, how do we make the user's choice to be informed about the level of accuracy?

maxwest-uw commented 1 year ago

agreed, especially point 3. In general for these types of data products (one-off generations) it's probably wise to go for the side of accuracy. maybe we can investigate the extent of the differences/inaccuracies at the poles before we rewrite everything (unless you think it would be a relatively quick change).

astronomy-commons / lsdb

Re-implementation of kd-tree cross-matching #41