apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.76k stars 428 forks source link

PARQUET-758: Add Float16/Half-float logical type #184

Closed anjakefala closed 11 months ago

anjakefala commented 2 years ago

In the Mailing List, I proposed the addition of a Half Float (float16) type in Parquet: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5

This type is becoming increasingly popular in Machine Learning, and there are a bundle of issues requesting its support in Parquet: https://issues.apache.org/jira/browse/PARQUET-1647 https://issues.apache.org/jira/browse/PARQUET-758 https://issues.apache.org/jira/browse/ARROW-17464 https://github.com/apache/arrow/issues/2691

This is my first logical type proposal! I followed this PR as inspiration, but really welcome feedback from the community.

Implementation PRs:


Make sure you have checked all steps below.

Jira

Commits

Documentation

pitrou commented 2 years ago

@anjakefala You need to add to the LogicalType union, not to the Type enum (which is for physical types).

Also cc @emkornfield

emkornfield commented 2 years ago

We should probably specify that using the Byte Split Encodings can be used for this type as well?

Also, in general, if possible try to avoid force pushing, as it makes it harder to compare iterative changes (this might not be the style in this repo, though so if you found instructions elsewhere on this, please ignore).

emkornfield commented 2 years ago

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

gszadovszky commented 2 years ago

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a Binary with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222)

pitrou commented 2 years ago

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (npy_half is just an unsigned 16-bit integer in this context)

gszadovszky commented 2 years ago

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (npy_half is just an unsigned 16-bit integer in this context)

It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

emkornfield commented 2 years ago

t is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve PARQUET-1222 so the same semantics would apply to all floating point numbers.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics).

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

gszadovszky commented 2 years ago

I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve PARQUET-1222. At least we should add a note about this issue.

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created PARQUET-2182 to track this.

emkornfield commented 2 years ago

I think Parquet C++ probably has the same issue. Let me reread PARQUET-1222. to see what the current state is and if we can push it forward.

pitrou commented 2 years ago

I agree with @emkornfield that ordering issues seem largely orthogonal, as they also affect FLOAT32 and FLOAT64 types.

anjakefala commented 2 years ago

@pitrou @emkornfield @gszadovszky

Is there anything I can do to move this addition forward? Can I help with any code?

In terms of design, my understanding from reading the comments is that @gszadovszky brought up an ordering concern (valid, but not a blocker?), and that a decision needs to be made on whether float16 would be implemented as a logical or physical type?

emkornfield commented 2 years ago

Sorry for the delay, it sounds like PARQUET-1222 is blocker, let me make a proposal there and see if we can at least come to consensus on approach and maybe this feature can be the first test-case for it.

emkornfield commented 1 year ago

Sorry for the delay but PARQUET-1222 has now been merged, so I think this is unblocked.

anjakefala commented 1 year ago

Thanks so much for the update @emkornfield!

What is the next step I can take?

emkornfield commented 1 year ago

@anjakefala IIUC, I think the prior objection was around not properly floating point sorting for statistics correctly. I think the main thing is to update the specification to say this requires the same sorting logic as float32 and float64. I need to rereview the current state of things, but then I think we can probably try to vote on the mailing list to see if this type is acceptable. I'm not sure on the exact process here (I don't know if implementations are required before a vote). @gszadovszky thoughts?

anjakefala commented 1 year ago

Thank you @emkornfield! I added the sort order to the spec.

anjakefala commented 1 year ago

Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing list for a vote? It seems @gszadovszky is not available for insight; is there anyone else that can provide it?

emkornfield commented 1 year ago

@shangxinli are there guidelines for what needs to happen to accept this addition?

pitrou commented 1 year ago

@shangxinli are there guidelines for what needs to happen to accept this addition?

I suppose it needs a discussion and then a formal vote on the ML?

shangxinli commented 1 year ago

As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released.

emkornfield commented 1 year ago

As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released.

It might have missed it but I didn't see Julien's reply on the dev mailing list. This seems reasonable though.

pitrou commented 1 year ago

It might have missed it but I didn't see Julien's reply on the dev mailing list. This seems reasonable though.

For full disclosure, it was a discussion involving the Parquet PMC and myself after I messaged the PMC to inquire about the way forward on this proposal.

julienledem commented 1 year ago

@emkornfield apologies for this, I realize I replied to a thread on the private list. @pitrou emailed private@ to get more eyes on this. Which worked. But yes, this discussion didn't need to be private. I replied: "I would recommend having the corresponding PRs for support in the Java and C++ implementation ready as well for this to be merged so that we don't release a new type that does not have support in the implementation."

JFinis commented 1 year ago

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a Binary with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222)

FWIW, I rather think it should be a physical type for the following reasons:

pitrou commented 1 year ago

FWIW, I rather think it should be a physical type for the following reasons:

  • encodings are currently only defined on the physical type, not the logical one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if it is a logical type.

This seems reasonable, but BYTE_STREAM_SPLIT is the only relevant encoding here (and it's probably not widely used yet).

  • Having this be a logical type while float and double are physical types seems inconsistent.

While I agree it seems inconsistent, this is an idealistic argument. If Float16 had been included from scratch, it would be logical ( :-) ) to make it a physical type because there would not be any compatibility problem.

  • There might eventually be hardware support or native language support for this for this type. In this case, having it as physical type would allow easier to leverage this hardware / language support, as most libraries instantiate encoders/decoders based on the physical type.

I'm not sure by how much making Float16 a physical type would make encoding/decoding easier. The main change would be that bytewidth is known at compile time, instead of at runtime. Other than that, having HW support for Float16 would not change the ease of copying data two bytes at a time...

  • IMHO, the basic idea behind physical and logical types is not to support forward compatibility; that is just a byproduct. Otherwise, there should just be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY).

That's partly true, but once the Parquet format is widely used, forward compatibility becomes a significant concern.

If Float16 is a new physical type, existing systems will probably not be able to read data with such a column at all. If Float16 is a new logical type, existing systems should be able to read the data; they just won't be able to draw any insight from the Float16 columns (but will be able to process the other columns).

To sum it up:

gszadovszky commented 1 year ago

I think that compatibility in Parquet file format is such a strong requirement that extending primitive types is simply not an option. (I agree though, if I would introduce a new file format I would specify only 3 primitive types: FIXED_LEN, VAR_LEN, BIT. But we already have what we have.) Meanwhile, I don't see why the Float16 physical type would be a requirement to use BYTE_STREAM_SPLIT. I don't think it is widely used so we can update the related spec to allow this encoding to be used for any primitive type (except BOOLEAN). Then, it is up to the implementations to use it for FIXED_LEN_BYTE_ARRAY[2] (FLOAT16).

anjakefala commented 1 year ago

PR for C++ float16 implementation in Parquet: https://github.com/apache/arrow/pull/36073

anjakefala commented 12 months ago

It seems that both the Java implementation and the C++ implementation are in a state of readiness.

Has the vote started? Can anyone with visibility update me on the status?

benibus commented 12 months ago

@anjakefala Agreed that everything seems to be in place. I'll be starting the vote on the ML later today.

anjakefala commented 12 months ago

@pitrou @emkornfield @gszadovszky @JFinis @julienledem @shangxinli

The vote has been started by @benibus here: https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871 Please chime in! I would also appreciate anyone forwarding the vote to the private listserv.

benibus commented 11 months ago

@pitrou @gszadovszky @julienledem

Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course)

wgtmac commented 11 months ago

@pitrou @gszadovszky @julienledem

Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course)

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

anjakefala commented 11 months ago

@wgtmac

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

This is the parquet-format PR!

There are too many PRs. xD

wgtmac commented 11 months ago

@wgtmac

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

This is the parquet-format PR!

There are too many PRs. xD

My bad! I got lost in these PRs.

pitrou commented 11 months ago

I agree with @emkornfield that we should allow the encoding BYTE_STREAM_SPLIT to be used for this new logical type. It is fine to handle it separately, though.

I would contend that perhaps BYTE_STREAM_SPLIT wouldn't yield very interesting results on FLOAT16. It would be interesting to get numbers.

benibus commented 11 months ago

I've suggested the name FLOAT_16 in the vote like we already have logical types INT_8 etc

I think this was only the convention for legacy ConvertedType enums. We could theoretically deviate from that since there are no sized integral logical types (they're all rolled into INTEGER/IntType).

As for BYTE_STREAM_SPLIT, my feeling is that we'll probably want it (for parity with FLOAT/DOUBLE, at least), but it could be added as a follow-up - along with an implementation + benchmarks, if necessary. There's also some ambiguity about whether to support BSS for FixedLenByteArray generally, which may warrant a separate discussion.

anjakefala commented 11 months ago

@gszadovszky What is the merging process once it has approval and passed voting? =)

gszadovszky commented 11 months ago

@benibus, could you officially close the vote on the mailing list so it is clear that it has passed? @anjakefala, since we have 3 approvals already on this PR, any committer can push it. I would wait for the official closing of the vote, thought.

benibus commented 11 months ago

For the record, I've announced the vote's passing in the original ML thread itself (apologies if the [RESULT] thread wasn't sufficient).

gszadovszky commented 11 months ago

Sorry, @benibus. My bad. Thank you for managing the vote! I'm pushing this PR...

gszadovszky commented 11 months ago

@anjakefala, do you have a jira user so I can assign it to you?

anjakefala commented 11 months ago

I really appreciate everyone who took time out of their lives to give this PR attention! :)) Thanks for the final merge @gszadovszky! And yes, my apache arrow JIRA handle is the same as github @anjakefala.