Rename ancillary files like "cataloginfo.json" to start with "" so they will be ignored by default.
Goal
Allow users to make a simple call to pandas.read_parquet (or other standard python parquet readers) without having to specify the ignore_prefixes keyword argument.
Details
Currently, the simplest call that works seems to be:
import pandas as pd
# assuming we're in the hipscat-import root directory
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"
pd.read_parquet(
small_sky_object_catalog,
partitioning=None, # see issue #367 for why this is necessary
ignore_prefixes=[
".",
"_",
"catalog_info.json",
"partition_info.csv",
"point_map.fits",
"provenance_info.json",
],
)
It's cumbersome to have to specify the ignore_prefixes kwarg every time, but without it that call throws the error:
ArrowInvalid: Could not open Parquet input source 'tests/hipscat_import/data/small_sky_object_catalog/partition_info.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the ignore_prefixes kwarg.
Before submitting
Please check the following:
[x] I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
[x] I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
[x] If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
The problem to be solved by a different directory strucutre:
The directory structure we are proposing follows:
/
- properties
- partition_info.csv
-
- dataset/
- _common_metadata
- _metadata
- Norder=K/
- Dir=J/
- Npix=M.parquet
In this way, the /dataset/ directory would be, by itself, a totally valid parquet dataset that can be read by many off-the-shelf parquet libraries.
Feature request
Request
Rename ancillary files like "cataloginfo.json" to start with "" so they will be ignored by default.
Goal
Allow users to make a simple call to
pandas.read_parquet
(or other standard python parquet readers) without having to specify theignore_prefixes
keyword argument.Details
Currently, the simplest call that works seems to be:
It's cumbersome to have to specify the
ignore_prefixes
kwarg every time, but without it that call throws the error:Filenames that start with "." or "_" are ignored by default, so renaming the ancillary files to start with "_" would allow the user to skip the
ignore_prefixes
kwarg.Before submitting Please check the following: