Support Hive partitioning at the Npix level

troyraen commented 1 month ago

Feature request

Request

Allow Npix to be a directory and to be named without ".parquet" at the end.

This could be a relaxation of the current standard (where Npix is expected to be a single file) rather than a complete change. There would be some work to update your existing code, but since most parquet readers take files and directories interchangeably you could probably also continue supporting the current standard without much trouble.

Benefits

Pyarrow (and thus, other readers that use it under the hood) could recognize Npix as an actual partition in the same way it recognizes Norder and Dir.
- This recognition can be quite important for efficient reads. I haven't tested the relative efficiency of this specific change, but the screenshot and surrounding text in astronomy-commons/hipscat#367 shows a similar case.
- This recognition would also mean that methods like pyarrow.dataset.get_partition_keys (demonstrated below) would return keys for all three partition levels instead of just the first two.
More than one file could be allowed in each leaf partition.
- This could make it much easier to update HATS catalogs. I have specific use cases in mind for both IRSA and Pitt-Google Broker that would rely on being able to do this.

Example demonstrating that Npix is not currently recognized as a partition

One way to see this is to ask pyarrow what the partitioning keys/values are for a specific leaf partition. Even if we explicitly tell it that Npix is a partition, it won't recognize it.

import pyarrow.dataset

# Assuming we're in the hipscat-import root directory.
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"
ignore_prefixes = [".", "_", "catalog_info.json", "partition_info.csv", "point_map.fits", "provenance_info.json"]

# Explicitly define the partitioning, including Npix.
partitioning_fields = [
    pyarrow.field(name="Norder", type=pyarrow.uint8()),
    pyarrow.field(name="Dir", type=pyarrow.uint64()),
    pyarrow.field(name="Npix", type=pyarrow.uint64()),
]
partitioning = pyarrow.dataset.partitioning(schema=pyarrow.schema(partitioning_fields), flavor="hive")

# Load the dataset and get a single fragment.
dataset = pyarrow.dataset.dataset(
    small_sky_object_catalog, ignore_prefixes=ignore_prefixes, partitioning=partitioning
)
frag = next(dataset.get_fragments())

# Look at the file path so we know which partition this is.
frag.path
# Output: 'tests/hipscat_import/data/small_sky_object_catalog/Norder=0/Dir=0/Npix=11.parquet'

# Ask for the expression that IDs this specific partition.
frag.partition_expression
# Output: <pyarrow.compute.Expression ((Norder == 0) and (Dir == 0))>

# The above didn't show the Npix partition.
# Just to make sure it's not hidden somewhere in the Expression object, ask for a plain dict.
pyarrow.dataset.get_partition_keys(frag.partition_expression)
# Output: {'Norder': 0, 'Dir': 0}

Before submitting Please check the following:

[x] I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
[x] I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
[x] If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

nevencaplar commented 1 month ago

To be implemented in HATS 0.4

delucchi-cmu commented 1 month ago

I believe there are no changes required in HATS, but we may want to allow for more custom path-finding when loading via LSDB.

astronomy-commons / lsdb