Closed mbjones closed 6 months ago
nice, thanks!
Re version, I think version is declared in the embedded metadata section, https://github.com/apache/parquet-format#extensibility. There are at least two versions (1.x and 2.x) now, I believe they are backward compatible and the key difference in later versions is the support for more encryption types, but I'm not 100%.
Note that parquet allows a range of encryption types even though parsers need only support gzip and snappy. I think(?) the metadata part contains all the relevant information though https://github.com/apache/parquet-format#metadata
This looks good as-is.
While application/x-parquet
seems to have taken off while other forms haven't, I'd bet that, if the Arrow team registered a media type for Parquet, they'd go with this one. My main evidence is (1) that 'vnd' is appropriate and (2) they've already registered with Arrow's stream and in-memory formats under the vnd prefix, e.g.,
https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file.
@amoeba any updates on the parquet mediaType?
Hey @mbjones, none that I can tell. The best place to track progress would probably be https://issues.apache.org/jira/browse/PARQUET-1889 and that conversation looks stalled. I'll reach out to the Parquet folks to see if I can find someone to send in a registration, or perhaps do it myself.
As of 2023-02-27, it seems the vnd.apache.parquet
is still in the approval process, and awaiting some information updates in the approval process. See https://lists.apache.org/thread/lrfsjhzoq20o95z5zn9zyrb8rdolqzz7 At this point, it seems unlikely that the media type will change, although I suppose its uncertain whether it will get approved. Should we move ahead with this, or just continue to wait for resolution of the media type?
Seems fine - there's a slight chance the media type may change before final approval, but that may be updated independently of the formatId if necessary.
Hey all, the media type for Apache Parquet of application/vnd.apache.parquet
is now official, see https://www.iana.org/assignments/media-types/application/vnd.apache.parquet.
Thank you, Bryce! Hope everything goes well with you!
Jing
On Wed, Feb 14, 2024 at 1:52 PM Bryce Mecum @.***> wrote:
Hey all, the media type for Apache Parquet of application/vnd.apache.parquet is now official, see https://www.iana.org/assignments/media-types/application/vnd.apache.parquet .
— Reply to this email directly, view it on GitHub https://github.com/DataONEorg/object-formats/issues/36#issuecomment-1944724367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5QQDGFTLBDCWUTQ5NPXITYTUWZHAVCNFSM5IC4GAC2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJUGQ3TENBTGY3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Likewise :)
Parquet Format
Provide the standard metadata for the proposed format, ensuring that the id and name are unique and appropriate to the version of the format being proposed.
application/vnd.apache.parquet
application/vnd.apache.parquet
parquet
Format description
Parquet is a columnar storage format that supports nested data that is becoming more commonly used in science applications. It is developed at the Apache Software Foundation (see https://parquet.apache.org/), and is used extensively in the Hadoop ecosystem. Libraries exist for Java, python, R, and other environments.
Specification / Namespace documentation
The format is defined at https://github.com/apache/parquet-format. There is no established media type yet, so we are proposing to use the vendor-specific format for the media type.
Checklist
image/png
is specific to one format, whereastext/xml
is not specific to one format)DATA
,METADATA
, orRESOURCE
Considerations