astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

Re-implementation of kd-tree cross-matching #41

Closed hombit closed 11 months ago

hombit commented 1 year ago

Problem The current KdTreeCrossmatch implementation uses gnomonic projection, which could be inaccurate for large healpix tiles, especially in the polar regions.

Proposed solution I propose to use 3-D tree based on xyz coordinates on the unit sphere. It would:

  1. Give the exact matches, because minimum 3-D distance corresponds to minimum great circle distance (minimum chord = minimum arc)
  2. Distances given by the kd-tree could be directly used for the search radius filtering, we just need to convert search radius to the corresponding chord length

Performance improvement opportunities The direct implementation of the proposed solution could have worse performance, because both kd-tree building and looking-up algorithms scale as ∼k. However, the proposed solution could apply few tricks to make it more performant:

maxwest-uw commented 1 year ago

since there are some trade-offs in speed and memory vs. accuracy, maybe it would be better to add a new cross-match routine and have the unit-sphere calculations be their own separate version, so that the user can weigh their computational needs themselves?

hombit commented 1 year ago

I generally agree that having a choice is good. However I have many concerns about having two close but a bit different implementations: 1) Right now It is not clear to me if 3d-tree implementation would be (much) slower 2) Supporting two things is harder than supporting one thing 3) The most important one, I don't think that it is a good idea to give an inaccurate solution, which could lead to the results different from other implementations (including algorithms implemented not in LSDB). From the user's perspective it is not even clear what accuracy this solution has: large healpix tiles could lead to errors much beyond the single-float point accuracy. So even if we tell that this algorithm is faster, but less accurate, how do we make the user's choice to be informed about the level of accuracy?

maxwest-uw commented 1 year ago

agreed, especially point 3. In general for these types of data products (one-off generations) it's probably wise to go for the side of accuracy. maybe we can investigate the extent of the differences/inaccuracies at the poles before we rewrite everything (unless you think it would be a relatively quick change).