Closed hombit closed 11 months ago
since there are some trade-offs in speed and memory vs. accuracy, maybe it would be better to add a new cross-match routine and have the unit-sphere calculations be their own separate version, so that the user can weigh their computational needs themselves?
I generally agree that having a choice is good. However I have many concerns about having two close but a bit different implementations: 1) Right now It is not clear to me if 3d-tree implementation would be (much) slower 2) Supporting two things is harder than supporting one thing 3) The most important one, I don't think that it is a good idea to give an inaccurate solution, which could lead to the results different from other implementations (including algorithms implemented not in LSDB). From the user's perspective it is not even clear what accuracy this solution has: large healpix tiles could lead to errors much beyond the single-float point accuracy. So even if we tell that this algorithm is faster, but less accurate, how do we make the user's choice to be informed about the level of accuracy?
agreed, especially point 3. In general for these types of data products (one-off generations) it's probably wise to go for the side of accuracy. maybe we can investigate the extent of the differences/inaccuracies at the poles before we rewrite everything (unless you think it would be a relatively quick change).
Problem The current
KdTreeCrossmatch
implementation uses gnomonic projection, which could be inaccurate for large healpix tiles, especially in the polar regions.Proposed solution I propose to use 3-D tree based on xyz coordinates on the unit sphere. It would:
Performance improvement opportunities The direct implementation of the proposed solution could have worse performance, because both kd-tree building and looking-up algorithms scale as ∼k. However, the proposed solution could apply few tricks to make it more performant:
_filter_indexes_to_threshold
and_great_circle_dist
. The replacement offrame_gnomonic
with spherical-to-euclidian coordinate transformation should have no-to-little performance impact.n_neighbors
(even if it is 1). That's why I believe that it would be beneficial to swap the query operations: first query for radius, and then filter by the firstn_neighbors
. With the current kd-tree realization this order would lead to even less accurate results.KDTree
which would reduce memory usage