astronomy-commons / hats

Hierarchical Progressive Survey Catalog
https://hats.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
17 stars 5 forks source link

Add "dataset" subdirectory for all parquet files #366

Closed troyraen closed 3 days ago

troyraen commented 2 weeks ago

Feature request

Request

Rename ancillary files like "cataloginfo.json" to start with "" so they will be ignored by default.

Goal

Allow users to make a simple call to pandas.read_parquet (or other standard python parquet readers) without having to specify the ignore_prefixes keyword argument.

Details

Currently, the simplest call that works seems to be:

import pandas as pd

# assuming we're in the hipscat-import root directory
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"

pd.read_parquet(
    small_sky_object_catalog,
    partitioning=None,  # see issue #367 for why this is necessary
    ignore_prefixes=[
        ".",
        "_",
        "catalog_info.json",
        "partition_info.csv",
        "point_map.fits",
        "provenance_info.json",
    ],
)

It's cumbersome to have to specify the ignore_prefixes kwarg every time, but without it that call throws the error:

ArrowInvalid: Could not open Parquet input source 'tests/hipscat_import/data/small_sky_object_catalog/partition_info.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the ignore_prefixes kwarg.


Before submitting Please check the following:

nevencaplar commented 1 week ago

The problem to be solved by a different directory strucutre:

The directory structure we are proposing follows: