apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.27k stars 3.47k forks source link

[Docs] Update extension type examples to not use UUID #43809

Open ianmcook opened 3 weeks ago

ianmcook commented 3 weeks ago

Describe the bug, including details regarding any error messages, version, and platform.

In the Format docs and Python docs, there are several examples of user-defined extension types and sample code showing how to implement them (by subclassing). These all use a UUID extension type as the example:

Now that UUID is a canonical extension type (#41299) and will have native support in C++ and Python (#37298), we should replace these with examples based on some other user-defined extension type—ideally one that is not likely to become a canonical extension type anytime soon. Maybe an XML or YAML extension type (with UTF8 storage type)?

Component(s)

Documentation

khwilson commented 2 weeks ago

I'm a long time fan of the arrow project and I was hoping to get a bit more involved in the project. I figured this might be a good first issue. Please let me know if it is not or the following is a bad idea.

Instead of something like a YAML or XML type, would a "rational" type make sense? Something like:

import pyarrow as pa
import pyarrow.types as pt

class RationalType(pa.ExtensionType):
    """
    A rational number represented as a struct of an integer `numer` (the numerator)
    and an integer `denom` (the denominator)
    """

    def __init__(self, data_type: pa.DataType):
        if not pt.is_integer(data_type):
            raise TypeError(f"data_type must be an integer type not {data_type}")

        super().__init__(
            pa.struct(
                [
                    ("numer", data_type),
                    ("denom", data_type),
                ],
            ),

            # N.B. This name does _not_ reference `data_type` so deserialization
            # will work for _any_ integer `data_type` after registration
            "my_package.rational",
        )

    def __arrow_ext_serialize__(self) -> bytes:
        # No serialized metadata necessary
        return b""

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata
        return RationalType(storage_type[0].type)

This shows off a few more of the parameters that are passed around than the current UUID example.

ianmcook commented 2 weeks ago

Thanks @khwilson! Sounds good to me.

@rok do you have any comments?

rok commented 2 weeks ago

Rational seems like a good example! Complex was discussed in the past too, but it will probably be proposed as a canonical type candidate (@sjperkins?). So if we're sure rational won't be a canonical type I think rational is a good candidate. It also feels like an easier type to give pedagogical examples on then YAML/XML. On the other hand some one could nicely show how string kernels work on string storage. We don't really need to pick one - we can mix it up.

sjperkins commented 2 weeks ago

Rational seems like a good example! https://github.com/apache/arrow/pull/10452 was discussed in the past too, but it will probably be proposed as a canonical type candidate (@sjperkins?). So if we're sure rational won't be a canonical type I think rational is a good candidate.

Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.

I'm currently focused in other areas at the moment, but would like to revisit Complex numbers at some point.

rok commented 2 weeks ago

Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.

Oh interesting approach. Is there other systems that do this? Would this approach be better fitted for vectorization? I suppose it would be more efficient for Parquet.

I'm currently focused in other areas at the moment, but would like to revisit Complex numbers at some point.

Feel free to ping me when you do!

ianmcook commented 2 weeks ago

@khwilson please tag me and @rok to review when you have PR open. Thanks!

khwilson commented 2 weeks ago

Will do!

On Mon, Aug 26, 2024 at 9:23 AM Ian Cook @.***> wrote:

@khwilson https://github.com/khwilson please tag me and @rok https://github.com/rok to review when you have PR open. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/apache/arrow/issues/43809#issuecomment-2310592440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU5ETHGLVPIQA2CF33WTLZTNJBDAVCNFSM6AAAAABNA6GESCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGU4TENBUGA . You are receiving this because you were mentioned.Message ID: @.***>