astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

Support Hive partitioning at the Npix level #435

Open troyraen opened 1 month ago

troyraen commented 1 month ago

Feature request

Request

Allow Npix to be a directory and to be named without ".parquet" at the end.

This could be a relaxation of the current standard (where Npix is expected to be a single file) rather than a complete change. There would be some work to update your existing code, but since most parquet readers take files and directories interchangeably you could probably also continue supporting the current standard without much trouble.

Benefits

  1. Pyarrow (and thus, other readers that use it under the hood) could recognize Npix as an actual partition in the same way it recognizes Norder and Dir.

    • This recognition can be quite important for efficient reads. I haven't tested the relative efficiency of this specific change, but the screenshot and surrounding text in astronomy-commons/hipscat#367 shows a similar case.
    • This recognition would also mean that methods like pyarrow.dataset.get_partition_keys (demonstrated below) would return keys for all three partition levels instead of just the first two.
  2. More than one file could be allowed in each leaf partition.

    • This could make it much easier to update HATS catalogs. I have specific use cases in mind for both IRSA and Pitt-Google Broker that would rely on being able to do this.

Example demonstrating that Npix is not currently recognized as a partition

One way to see this is to ask pyarrow what the partitioning keys/values are for a specific leaf partition. Even if we explicitly tell it that Npix is a partition, it won't recognize it.

import pyarrow.dataset

# Assuming we're in the hipscat-import root directory.
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"
ignore_prefixes = [".", "_", "catalog_info.json", "partition_info.csv", "point_map.fits", "provenance_info.json"]

# Explicitly define the partitioning, including Npix.
partitioning_fields = [
    pyarrow.field(name="Norder", type=pyarrow.uint8()),
    pyarrow.field(name="Dir", type=pyarrow.uint64()),
    pyarrow.field(name="Npix", type=pyarrow.uint64()),
]
partitioning = pyarrow.dataset.partitioning(schema=pyarrow.schema(partitioning_fields), flavor="hive")

# Load the dataset and get a single fragment.
dataset = pyarrow.dataset.dataset(
    small_sky_object_catalog, ignore_prefixes=ignore_prefixes, partitioning=partitioning
)
frag = next(dataset.get_fragments())

# Look at the file path so we know which partition this is.
frag.path
# Output: 'tests/hipscat_import/data/small_sky_object_catalog/Norder=0/Dir=0/Npix=11.parquet'

# Ask for the expression that IDs this specific partition.
frag.partition_expression
# Output: <pyarrow.compute.Expression ((Norder == 0) and (Dir == 0))>

# The above didn't show the Npix partition.
# Just to make sure it's not hidden somewhere in the Expression object, ask for a plain dict.
pyarrow.dataset.get_partition_keys(frag.partition_expression)
# Output: {'Norder': 0, 'Dir': 0}

Before submitting Please check the following:

nevencaplar commented 1 month ago

To be implemented in HATS 0.4

delucchi-cmu commented 1 month ago

I believe there are no changes required in HATS, but we may want to allow for more custom path-finding when loading via LSDB.