Open nevencaplar opened 9 months ago
Generally, if we're re-partitioning a data set that has been filtered, the resulting partitions will be even coarser, and I think that re-partitioning after such a filter would only make the extra-source-partition-reading problem worse.
Also, this sounds like functionality described in https://github.com/astronomy-commons/lsdb/issues/104, in that you want to perform a re-partition on-the-fly after the spatial filtering. I'm not sure we'd want to always force a re-partitioning after a spatial filter, but could document it as a best practice?
Yeah, this is a similar case where we do on-the-fly repartitioning, but in this case where we know where all the points are with the spatial filter, forcing the repartitioning at a higher order might be more efficient for some future operations such as joining to sources since with lower order most of the pixel will be empty space.
I still have complaints.
object_table.repartition(partitions=source_table.pixel_list())
. In this case, we also don't need to compute the statistics of the partitions, because we know the partition structure we would like to use.
When selecting only a spatial subset of data from object table, and then asking for the data from the source table currently we would pull all the data from the source partitions which correspond to partitions selected in the object table. Because the object table partitioning is much coarser, we are pulling data from many sources partitions which would not necessarily in the original spatial query.
The idea is to not have to load the data from the partitions in the source table that are not selected. When the spatial filter is performed on the object table we could repartition the resulting object table to a higher order.