DataONEorg / object-formats

DataONE Object Formats controlled vocabulary
Apache License 2.0
1 stars 3 forks source link

Apache Parquet #36

Closed mbjones closed 6 months ago

mbjones commented 2 years ago

Parquet Format

Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.

Format description

Parquet is a columnar storage format that supports nested data that is becoming more commonly used in science applications. It is developed at the Apache Software Foundation (see https://parquet.apache.org/), and is used extensively in the Hadoop ecosystem. Libraries exist for Java, python, R, and other environments.

Specification / Namespace documentation

The format is defined at https://github.com/apache/parquet-format. There is no established media type yet, so we are proposing to use the vendor-specific format for the media type.

Checklist

Considerations

cboettig commented 2 years ago

nice, thanks!

cboettig commented 2 years ago

Re version, I think version is declared in the embedded metadata section, https://github.com/apache/parquet-format#extensibility. There are at least two versions (1.x and 2.x) now, I believe they are backward compatible and the key difference in later versions is the support for more encryption types, but I'm not 100%.

Note that parquet allows a range of encryption types even though parsers need only support gzip and snappy. I think(?) the metadata part contains all the relevant information though https://github.com/apache/parquet-format#metadata

amoeba commented 2 years ago

This looks good as-is.

While application/x-parquet seems to have taken off while other forms haven't, I'd bet that, if the Arrow team registered a media type for Parquet, they'd go with this one. My main evidence is (1) that 'vnd' is appropriate and (2) they've already registered with Arrow's stream and in-memory formats under the vnd prefix, e.g., https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file.

mbjones commented 1 year ago

@amoeba any updates on the parquet mediaType?

amoeba commented 1 year ago

Hey @mbjones, none that I can tell. The best place to track progress would probably be https://issues.apache.org/jira/browse/PARQUET-1889 and that conversation looks stalled. I'll reach out to the Parquet folks to see if I can find someone to send in a registration, or perhaps do it myself.

mbjones commented 1 year ago

As of 2023-02-27, it seems the vnd.apache.parquet is still in the approval process, and awaiting some information updates in the approval process. See https://lists.apache.org/thread/lrfsjhzoq20o95z5zn9zyrb8rdolqzz7 At this point, it seems unlikely that the media type will change, although I suppose its uncertain whether it will get approved. Should we move ahead with this, or just continue to wait for resolution of the media type?

datadavev commented 1 year ago

Seems fine - there's a slight chance the media type may change before final approval, but that may be updated independently of the formatId if necessary.

amoeba commented 7 months ago

Hey all, the media type for Apache Parquet of application/vnd.apache.parquet is now official, see https://www.iana.org/assignments/media-types/application/vnd.apache.parquet.

taojing2002 commented 7 months ago

Thank you, Bryce! Hope everything goes well with you!

Jing

On Wed, Feb 14, 2024 at 1:52 PM Bryce Mecum @.***> wrote:

Hey all, the media type for Apache Parquet of application/vnd.apache.parquet is now official, see https://www.iana.org/assignments/media-types/application/vnd.apache.parquet .

— Reply to this email directly, view it on GitHub https://github.com/DataONEorg/object-formats/issues/36#issuecomment-1944724367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5QQDGFTLBDCWUTQ5NPXITYTUWZHAVCNFSM5IC4GAC2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJUGQ3TENBTGY3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

amoeba commented 7 months ago

Likewise :)