astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
17 stars 5 forks source link

Repartitioning object catalog after spatial filtering to improve speed #133

Open nevencaplar opened 9 months ago

nevencaplar commented 9 months ago

When selecting only a spatial subset of data from object table, and then asking for the data from the source table currently we would pull all the data from the source partitions which correspond to partitions selected in the object table. Because the object table partitioning is much coarser, we are pulling data from many sources partitions which would not necessarily in the original spatial query.

The idea is to not have to load the data from the partitions in the source table that are not selected. When the spatial filter is performed on the object table we could repartition the resulting object table to a higher order.

delucchi-cmu commented 9 months ago

Generally, if we're re-partitioning a data set that has been filtered, the resulting partitions will be even coarser, and I think that re-partitioning after such a filter would only make the extra-source-partition-reading problem worse.

Also, this sounds like functionality described in https://github.com/astronomy-commons/lsdb/issues/104, in that you want to perform a re-partition on-the-fly after the spatial filtering. I'm not sure we'd want to always force a re-partitioning after a spatial filter, but could document it as a best practice?

smcguire-cmu commented 8 months ago

Yeah, this is a similar case where we do on-the-fly repartitioning, but in this case where we know where all the points are with the spatial filter, forcing the repartitioning at a higher order might be more efficient for some future operations such as joining to sources since with lower order most of the pixel will be empty space.

delucchi-cmu commented 8 months ago

I still have complaints.