Open ianmcook opened 3 weeks ago
I'm a long time fan of the arrow project and I was hoping to get a bit more involved in the project. I figured this might be a good first issue. Please let me know if it is not or the following is a bad idea.
Instead of something like a YAML or XML type, would a "rational" type make sense? Something like:
import pyarrow as pa
import pyarrow.types as pt
class RationalType(pa.ExtensionType):
"""
A rational number represented as a struct of an integer `numer` (the numerator)
and an integer `denom` (the denominator)
"""
def __init__(self, data_type: pa.DataType):
if not pt.is_integer(data_type):
raise TypeError(f"data_type must be an integer type not {data_type}")
super().__init__(
pa.struct(
[
("numer", data_type),
("denom", data_type),
],
),
# N.B. This name does _not_ reference `data_type` so deserialization
# will work for _any_ integer `data_type` after registration
"my_package.rational",
)
def __arrow_ext_serialize__(self) -> bytes:
# No serialized metadata necessary
return b""
@classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata
return RationalType(storage_type[0].type)
This shows off a few more of the parameters that are passed around than the current UUID example.
Thanks @khwilson! Sounds good to me.
@rok do you have any comments?
Rational seems like a good example! Complex was discussed in the past too, but it will probably be proposed as a canonical type candidate (@sjperkins?). So if we're sure rational won't be a canonical type I think rational is a good candidate. It also feels like an easier type to give pedagogical examples on then YAML/XML. On the other hand some one could nicely show how string kernels work on string storage. We don't really need to pick one - we can mix it up.
Rational seems like a good example! https://github.com/apache/arrow/pull/10452 was discussed in the past too, but it will probably be proposed as a canonical type candidate (@sjperkins?). So if we're sure rational won't be a canonical type I think rational is a good candidate.
Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.
I'm currently focused in other areas at the moment, but would like to revisit Complex numbers at some point.
Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.
Oh interesting approach. Is there other systems that do this? Would this approach be better fitted for vectorization? I suppose it would be more efficient for Parquet.
I'm currently focused in other areas at the moment, but would like to revisit Complex numbers at some point.
Feel free to ping me when you do!
@khwilson please tag me and @rok to review when you have PR open. Thanks!
Will do!
On Mon, Aug 26, 2024 at 9:23 AM Ian Cook @.***> wrote:
@khwilson https://github.com/khwilson please tag me and @rok https://github.com/rok to review when you have PR open. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/apache/arrow/issues/43809#issuecomment-2310592440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU5ETHGLVPIQA2CF33WTLZTNJBDAVCNFSM6AAAAABNA6GESCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGU4TENBUGA . You are receiving this because you were mentioned.Message ID: @.***>
Describe the bug, including details regarding any error messages, version, and platform.
In the Format docs and Python docs, there are several examples of user-defined extension types and sample code showing how to implement them (by subclassing). These all use a UUID extension type as the example:
Now that UUID is a canonical extension type (#41299) and will have native support in C++ and Python (#37298), we should replace these with examples based on some other user-defined extension type—ideally one that is not likely to become a canonical extension type anytime soon. Maybe an XML or YAML extension type (with UTF8 storage type)?
Component(s)
Documentation