Repartitioning object catalog after spatial filtering to improve speed

nevencaplar commented 9 months ago

When selecting only a spatial subset of data from object table, and then asking for the data from the source table currently we would pull all the data from the source partitions which correspond to partitions selected in the object table. Because the object table partitioning is much coarser, we are pulling data from many sources partitions which would not necessarily in the original spatial query.

The idea is to not have to load the data from the partitions in the source table that are not selected. When the spatial filter is performed on the object table we could repartition the resulting object table to a higher order.

delucchi-cmu commented 9 months ago

Generally, if we're re-partitioning a data set that has been filtered, the resulting partitions will be even coarser, and I think that re-partitioning after such a filter would only make the extra-source-partition-reading problem worse.

Also, this sounds like functionality described in https://github.com/astronomy-commons/lsdb/issues/104, in that you want to perform a re-partition on-the-fly after the spatial filtering. I'm not sure we'd want to always force a re-partitioning after a spatial filter, but could document it as a best practice?

smcguire-cmu commented 8 months ago

Yeah, this is a similar case where we do on-the-fly repartitioning, but in this case where we know where all the points are with the spatial filter, forcing the repartitioning at a higher order might be more efficient for some future operations such as joining to sources since with lower order most of the pixel will be empty space.

delucchi-cmu commented 8 months ago

I still have complaints.

I don't think we should always re-partition the data after a spatial filter, since we don't know what the user will want (either re-partition to higher order or lower order). I think this should be solved with documentation and tutorials to explain to a user under what conditions they would want to re-partition, and some hints about how the partitions impact performance.
Spatial filter is not the only kind of filter that would reduce data size to the point that the user would want to re-partition. e.g. an index search might only return 100 objects, and you might only want 1 partition at the end of it.
If the intent is to partition s.t. the object table more closely matches the source table, then I think we should offer additional options for the on-the-fly repartitioning method like object_table.repartition(partitions=source_table.pixel_list()). In this case, we also don't need to compute the statistics of the partitions, because we know the partition structure we would like to use.

astronomy-commons / lsdb

Repartitioning object catalog after spatial filtering to improve speed #133