ethereum / pyrlp

The python RLP serialization library
MIT License
99 stars 287 forks source link

RLP fields as proper object attributes. #44

Open pipermerriam opened 7 years ago

pipermerriam commented 7 years ago

What is the problem

RLP objects are not very good at inheritance.

class MyBaseObject(rlp.Serializable):
    fields = (
        ('field_a', sedes.Binary),
    )

class InheritsFromBase(MyBaseObject):
    fields = (
        ('field_a', sedes.Binary),  # not DRY, repeated from base class
        ('another_field', sedes.Binary),
    )

There is no good pythonic way to extend a base class's fields.

How can we fix it.

Metaclass wizardry!

3r7i1c

Much the same way that django models perform magical metaclass voodoo, we can do the same. Implement a metaclass for use that finds all attributes which are valid sedes types and does all the fancy stuff that happens within the __init__ method of Serializable. In general, doing away with Serializable.fields seems like a nice cleanup.

The example above would look something like:

class MyBaseObject(rlp.Serializable):
    field_a = sedes.Binary()

class InheritsFromBase(MyBaseObject):
    field_a = sedes.Binary()  # required to maintain explicit field ordering.
    another_field = sedes.Binary()

Implementation specifics.

Field class

We need a Field class which we can use to define RLP fields. This should be a thin wrapper of sorts around a rlp.sedes object. Ideally, we can have some sort of thin shortcut for converting a sedes object to a Field object using something like a classmethod Field.from_sedes(...)

Metaclass

Unless an alternate solution is presented, the rlp.Serializable class needs to be converted to use a metaclass. This metaclass will need to introspect the current fields as well as all of the fields for the base class to construct the class.

I suggest this part be done in two steps.

The first step would be to have the metaclass construct an rlp.Serializable instance with the proper fields property, and to allow the existing internals of rlp.Serializable to populate the actual attribute setters and getters. Under this approach accessing the field properties from the class would likely result in an AttributeError as they would not be constructed until class instantiation.

The second step is to invert this, and have the metaclass (or the underlying Field class) create these attributes at the class level, likely using descriptors (which the Field class could implement). Then, the fields property on the class would only be present for backwards compatibility.

Field ordering

The field ordering for RLP objects is important. Luckily, this is easy to do using metaclasses.

https://docs.python.org/3/reference/datamodel.html#metaclass-example

However, this gets complicated in cases of inheritance. From the discussion on this issue, we've decided that the best solution to fixing this is to required that either all fields or zero field be overridden when a class is sub classed.

Documentation

The documentation will need to be updated to reflect this new API.

gsalgado commented 7 years ago

Actually, overwriting a parent's fields attribute in a rlp.Serializable does not work currently: #45

gitcoinbot commented 6 years ago

This issue now has a funding of 0.4 ETH (413.69 USD) attached to it.

pipermerriam commented 6 years ago

If you are interested in claiming this bounty please submit a proposal here on this issue as well as asking any questions you may have prior to claiming the bounty. I'd like to avoid:

  1. having multiple people work concurrently on this and
  2. having someone waste their time on an implementation for whatever reason isn't something we can merge.
fubuloubu commented 6 years ago

Brute force way for starters, unless I am misunderstanding the requirements:

class sedesBaseClass:
    pass # Whatever base class all fields have in common

class rlpSerializeable:
    # Generate fields on demand
    @property
    def fields(self):
        # I know there's a way to make this method a tuple generator
        ls = []
        for field in dir(self):
            if field is 'fields':
                continue # Avoid infinite recursion
            val = getattr(self, field)
            if isinstance(val, sedesBaseClass):
                ls.append((field, val))
        return tuple(ls)

    # or whatever function serializes the object
    def serialize(self):
        return self.fields

class A(rlpSerializeable):
    field_a = sedesBaseClass()

class B(A):
    field_b = sedesBaseClass()

class C(A):
    field_c = sedesBaseClass()

if __name__ == '__main__':
    a_fields = A().serialize()
    assert a_fields[0][0] == 'field_a'
    b_fields = B().serialize()
    assert b_fields[0][0] == 'field_a'
    assert b_fields[1][0] == 'field_b'
    c_fields = C().serialize()
    assert c_fields[0][0] == 'field_a'
    assert c_fields[1][0] == 'field_c'
    print("Works!")
pipermerriam commented 6 years ago

@fubuloubu

đź‘Ť for including backwards compatability with the self.fields. I'm going to add this as a requirement to the issue.

However, I believe this fails at the class level. A.fields would not return accurate information since the fields property will only be populated on instances of A. The only solution I know of for this is to use metaclasses but I'm open to alternative ideas if you have them.

fubuloubu commented 6 years ago

Can you provide a counter example that shows that? I'm not sure I understand what you mean.

Do these classes dynamically add/remove fields? Also, wouldn't it be able to get these fields at the classes level because they're defined as class variables? I'm not too familiar with Django to know the idea you described offhand.

pipermerriam commented 6 years ago

Actually, while working up an example to show you, it exposed some even deeper problems with your approach (hint, the isinstance(value, sedesBaseClass) doesn't work. I'd suggest spending some time exporing the problem and coming back. I'm low on time right now, otherwise I'd give you a write-up.

carver commented 6 years ago

Another issue: RLP is order-dependent. What is the right order? I assume it's that the super-class fields all come first, in definitional order. Then subclasses, in defined-order. Can we even retrieve attributes in definitional order?

dir() returns fields in alphabetical order. From help(dir):

If called without an argument, return the names in the current scope. Else, return an alphabetized list of names comprising (some of) the attributes of the given object, and of attributes reachable from it.

fubuloubu commented 6 years ago

Haha, me too. I just looked at the issue at the brute force way seemed like a good enough start. Someone else can build off this if don't get a chance to revisit.

pipermerriam commented 6 years ago

@carver looking at the docs for python3, it looks like metaclasses can maintain field ordering.

import collections

class OrderedClass(type):

    @classmethod
    def __prepare__(metacls, name, bases, **kwds):
        return collections.OrderedDict()

    def __new__(cls, name, bases, namespace, **kwds):
        result = type.__new__(cls, name, bases, dict(namespace))
        result.members = tuple(namespace.keys())
        result.base_members = [tuple(base.members) for base in bases if hasattr(base, 'members')]
        return result

class A(metaclass=OrderedClass):
    def one(self): pass
    def two(self): pass
    def three(self): pass
    def four(self): pass

class B(A):
    def five(self): pass

>>> A.members
('__module__', '__qualname__', 'one', 'two', 'three', 'four')
>>> A.base_members
[]
>>> B.members
('__module__', '__qualname__', 'five')
>>> B.base_members
[('__module__', '__qualname__', 'one', 'two', 'three', 'four')]

With that information, we would be able to maintain field ordering within a class, as well as treating new fields in subclasses as they should be appended by default with cls.__mro__ for ordering of base classes. That makes our default behavior:

That gets us most of the way there. I think the only thing we need to allow full control over field ordering is a way to dictate where in the previous class's fields a new field should go assuming they don't want to use the default ordering. Two options come to mind for dealing with this and we might want to do both (or some other better idea).

Support before and after keyword arguments for the Field class.

The Field class could take one of the keyword arguments before or after which would specify a the position that the new field should be placed in the field order.

class A:
    one = Field()
    two = Field()
    four = Field()

class B(A):
    three = Field(after='two')
    # three = Field(before='four')  # using before
    # three = Field(order='after:two')  # or maybe using a mini DSL

B.fields = ('one', 'two', 'three', 'four')

Support meta params of some sort.

Borrowed heavily from django, we could support something like the following for explicitely declaring field ordering.

class A:
    one = Field()
    two = Field()
    four = Field()

class B(A):
    three = Field()

    class Meta:
        field_order = ('one', 'two', 'three', 'four')

B.fields = ('one', 'two', 'three', 'four')
fubuloubu commented 6 years ago

Fixed the brute force case so that it recurses in Inheritance order. This method only works if there is not dynamic modification of the fields, which I think is true based on using class variables

class sedesClass:
    pass # Whatever class(es) needed

def get_fields(cls):
    fields = []
    if cls is rlpSerializeable:
        return tuple() # End recursion
    for parent in cls.__bases__:
        fields.extend(get_fields(parent))
    # Fields are either empty tuple or tuple of tuples
    if len(cls._fields) > 0 and isinstance(cls._fields[0], tuple):
        fields.extend(list(cls._fields))
    else:
        fields.append(cls._fields)
    return fields

class rlpSerializeable:
    @property
    def fields(self):
        return get_fields(self.__class__)

    # or whatever function serializes the object
    def serialize(self):
        return tuple(self.fields)

class A(rlpSerializeable):
    _fields = (
            ('field_a', sedesClass)
        )

class B(A):
    _fields = (
            ('field_b', sedesClass)
        )

class C(A):
    _fields = (
            ('c_field', sedesClass),
            ('another_field', sedesClass)
        )

class D(C):
    pass # No fields

if __name__ == '__main__':
    a_fields = A().serialize()
    assert a_fields[0][0] == 'field_a'
    b_fields = B().serialize()
    assert b_fields[0][0] == 'field_a'
    assert b_fields[1][0] == 'field_b'
    c_fields = C().serialize()
    assert c_fields[0][0] == 'field_a'
    assert c_fields[1][0] == 'c_field'
    assert c_fields[2][0] == 'another_field'
    d_fields = D().serialize()
    assert d_fields[0][0] == 'field_a'
    assert d_fields[1][0] == 'c_field'
    assert d_fields[2][0] == 'another_field'
    print("Works!")
carver commented 6 years ago

looking at the docs for python3, it looks like metaclasses can maintain field ordering.

Hah, cool that this is literally the textbook example for metaclasses.

Support before and after keyword arguments for the Field class.

I'm less keen on this style, because the ordering information doesn't semantically belong in the sedes Field.

Support meta params ... explicitely declaring field ordering.

This seems cleaner and more explicit, with a bonus of being familiar to anyone with Django experience.

pipermerriam commented 6 years ago

To summarize:

jannikluhn commented 6 years ago

Alternative proposal: Keep the fields attribute. Make fields of the parent class available in the class body of the child. Example usage:

class Parent(rlp.Serializable):
    fields = [
        ('a', big_endian_int),
        ('b', big_endian_int),
    ]

class Child1(Parent):
   """"Appends a field"""
    fields = parent_fields + [
        ('c', big_endian_int)
    ]

class Child2(Parent):
    """Prepends a field"""
    fields = [
        ('z', big_endian_int)
    ] + parent_fields

class Child3(Parent):
    """"Replaces all fields""""
    fields = [
        ('x', big_endian_int)
    ]

To make creating fields easier for more complex changes, one could add helper functions:

class Child4(Parent):
    """"Replaces some field"""
    fields = replace_fields(parent_fields, [
        ('a', binary)
    ])

class Child5(Parent):
    """"Inserts a field.""""
    fields = insert_fields_after(parent_fields, 'a', [
        ('a2', big_endian_int),
    ])

class Child6(Parent):
    """"Remove a field."""
    fields = remove_fields(parent_fields, [
        'a',
    ]

Additionally, one could replace by setting an additional field:

class Child6(Parent):
    """"Replace some fields.""""
    replaced_fields = [
        ('a', binary)
    ]

If we turn fields from a list of tuples to an OrderedDict cytoolz or similar libraries might already have helpers like replace_fields or insert_fields_after.

Implementation would be straightforward:

class SerializableMeta(type):

    @classmethod
    def __prepare__(metacls, name, bases):
        namespace = {}
        # look for `fields` in bases (if there are multiple, either raise an error or take the first)
        # if found set `namespace['parent_fields'] = bases[0].fields`
        return namespace

    def __new__(cls, name, bases, namespace):
        # look for `'replaced_fields'` in namespace
        # if found replace fields in `namespace['fields']` accordingly
        return result

Obviously, this doesn't use Django-style field properties. So if that's a hard requirement, this approach wouldn't work. I personally prefer the old style though (because I find it more explicit and clearer), but I'm probably biased in that regard.

pipermerriam commented 6 years ago

Obviously, this doesn't use Django-style field properties....

My intent with this idea was to make overriding existing fields easy. I didn't at the time consider the field ordering issue which makes the solution less perfect, however, I think that the solutions that are currently on the table to manage ordering are good-enough.

I also think that it'd be fine to do something the way you propose with some helpers. I don't like the solution as much because it feels less elegant but elegance isn't a hard requirement. The field class approach loses some of its elegance when futzing with field ordering comes into play so that's one mark against it, however, the helper functions that operate on the fields attribute as you propose maintains a certain level of maybe-hard-to-read-verbosity. Neither is ideal but I'm still leaning towards django-style-fields. Curious to hear anyone else's thoughts on one vs the other.

carver commented 6 years ago

If a second RLP renames or reorders fields of a first, then I think the second RLP is not a strict subclass of the first, and we probably shouldn't encourage treating it as one. Instead, we might do something like

COMMON_FIELDS = (
    ('a', big_endian_int),
    ('b', big_endian_int),
) 

class Serial1(rlp.Serializable):
    _fields = COMMON_FIELDS

class SerialChild(Serial1):
    """
    Appends a field. All the parent fields are present, in the same order.
    """
    _fields = ( 
        ('c', big_endian_int),
    )

class Serial2(rlp.Serializable):
    """Prepends a field"""
    _fields = (
        ('z', big_endian_int),
    ) + COMMON_FIELDS 

class Serial3(rlp.Serializable):
    """"Replaces all fields""""
    _fields = (
        ('x', big_endian_int), 
    ) 

I agree that the Django-style approach is more elegant, especially for single RLP class definitions. Since I think single explicitly-defined RLP classes are the common case (and maybe we should encourage them to be), I prefer the Django-style approach.

class Serial1(rlp.Serializable):
    a = big_endian_int
    b = big_endian_int

class SerialChild(Serial1):
    """
    Appends a field. All the parent fields are present, in the same order.
    """
    c = big_endian_int

class Serial2(rlp.Serializable):
    """Prepends a field"""
    z = big_endian_int
    a = big_endian_int
    b = big_endian_int

class Serial3(rlp.Serializable):
    """"Replaces all fields""""
    x = big_endian_int

That little bit of repetition on a and b in Serial2 seems worth the benefit of being to be easy to read/understand. If Serial2 is a common case, I'd love to see some real-life examples.

pipermerriam commented 6 years ago

The more I think about it, the more I think that RPL object inheritance for the purpose of changing fields won't be a major use case. This is based on how things are setup in Py-EVM.

One option we haven't considered is requiring that either all fields be overridden in subclasses or none at all. Basically, rather than having a subtle API for ordering fields that we understand but is probably not intuitive, we can instead raise an exception during class construction if not all of the fields were overridden. This forces explicit field ordering in a manner that should make all RLP objects easy to know what the field ordering is since you just have to trace up the inheritance chain to the first object which defines any fields.

carver commented 6 years ago

One option we haven't considered is requiring that either all fields be overridden in subclasses or none at all.

:+1: I'm in favor of this

pipermerriam commented 6 years ago

I've updated the main issue to reflect what I believe is the current consensus for how this should be implemented.

vs77bb commented 6 years ago

@pipermerriam Is this one you still want to be worked? I'll resend it out on Slack / Twitter, if so

pipermerriam commented 6 years ago

@vs77bb yes, this is an issue that would be a big nice-to-have improvement to the python RLP API.

owocki commented 6 years ago

hey @pipermerriam,

we are in the middle of a migration between smart contracts and in order to migrate away from all the old code paths, we are cleaning up some of the straggler bounties.

we're sorry this one didnt work out for ya.

do you want to kill the bounty here to recover your funds? it'll take only 30 seconds... and itll mean you get that 0.4 ETH back.

any questions, let me know. KSO

pipermerriam commented 6 years ago

Thansk @owocki

In other news: this was halfway done with the merger of https://github.com/ethereum/pyrlp/pull/62

Next step would be to deprecate fields attribute based declaration of what the RLP fields are in favor of a basic Field class. This should be quite trivial to do now that the metaclass is already in place.

owocki commented 6 years ago

happy to put a bounty up for ya if you want. is it worth creating a clean issue ?

Bhargavasomu commented 5 years ago

@pipermerriam I was going through this issue and wanted to take a shot at it. Is this issue still up (or) has been resolved in the past? If it's still up, the whole issue at this point of time is boiled down to just Implement Inheritance by appending right? Or is there still more to it?

pipermerriam commented 5 years ago

I don't think it's worth prioritizing over other efforts. Maybe do it in the SSZ repo instead and we can backport the implementation to pyrlp? Probably of higher value in the SSZ repo and lower risk.

onyb commented 5 years ago

I feel that a solution using metaclass is an overkill, and should only be used when all other options have been exhausted. I propose a new API for Serializable classes, by using class decorators.

This is how we define a Serializable class today:

class Base(rlp.Serializable):
    fields = (
        ('field_a', sedes.Binary),
    )

class InheritsFromBase(MyBaseObject):
    fields = (
        ('field_a', sedes.Binary),
        ('another_field', sedes.Binary),
    )

This is the API I propose:

@rlp.serializable
class Base:
    field_a: sedes.Binary

@rlp.serializable
class InheritsFromBase(MyBaseObject):
    another_field: sedes.Binary

From what I have understood so far, we have the following requirements / constraints with such classes:

  1. We want RLP fields to be instance attributes, instead of properties.
  2. We want to maintain explicit field ordering.
  3. We want our classes to be inheritance-friendly.

Proposed solution

If it's not obvious already from the proposed API, I intend to use dataclasses which is a new feature in Python 3.7. Essentially, the rlp.serializable decorator is a wrapper around dataclasses.dataclass that rewrites the class with auto-generated __init__ and __repr__ methods, preserving the order in which they have been defined.

Caveats

Field ordering

According to the Python docs, dataclass looks through all of the class’s base classes in reverse MRO. This may not be the behavior that we want, which can be demonstrated by the following example:

from dataclasses import dataclass
from rlp import sedes

@dataclass
class Base:
    a: sedes.Binary
    b: sedes.Binary

@dataclass
class Inherited(Base):
    z: sedes.Binary
    a: sedes.BigEndianInt

>>> Base.__dataclass_fields__.keys()
dict_keys(['a', 'b'])

>>> Inherited.__dataclass_fields__.keys()
dict_keys(['a', 'b', 'z'])

This could be a deal breaker if:

Python 3.7 only

dataclasses have not been backported to the Python 3.6 standard library. However the original author maintains a backport for Python 3.6 as a package.

The alternative

Provided that we're okay with a third-party dependency, we can solve both the caveats mentioned above, by using attrs. See example below:

import attr
from rlp import sedes

@attr.s
class Base:
    a: sedes.Binary = attr.ib()
    b: sedes.Binary = attr.ib()

@attr.s
class Inherited(Base):
    z: sedes.Binary = attr.ib()
    a: sedes.BigEndianInt = attr.ib()

>>> [each.name for each in Base.__attrs_attrs__]
['a', 'b']

>>> [each.name for each in Base.__attrs_attrs__]
['b', 'z', 'a']

Conclusion

I'll conclude with two questions:

This is also a good opportunity to remove some of the black magic in this file.

jannikluhn commented 5 years ago

I'm not strictly against a decorator API, but in my opinion defining methods in a metaclass is the cleaner approach than monkey patching them in a decorator. I agree that serializable.py is currently quite confusing, but I don't think that's inherently because of the use of metaclasses.

I don't like the way inheritance and field ordering is handled in both attrs and dataclasses as it's definitely not what one would expect. If we replace that, would there be enough of them left that justify the additional dependency?

onyb commented 5 years ago

defining methods in a metaclass is the cleaner approach than monkey patching them in a decorator.

In my experience, meta-programming produces code that's hard to debug, less readable, and often difficult to extend. Some of these problems are solved to a great extent by attrs. I would like to quote from their philosophy, which is something we can try to internalise for the PyRLP project.

You define a class and attrs adds static methods to that class based on the attributes you declare. The end. It doesn’t add metaclasses. It doesn’t add classes you’ve never heard of to your inheritance tree. An attrs class in runtime is indistiguishable from a regular class: because it is a regular class with a few boilerplate-y methods attached.

Personally, I would prefer an explicit metaclass= argument in the RLP classes, like this:

class MyBaseObject(rlp.Serializable, metaclass=Wizardy):  # same as @Wizardry decorator
    field_a = sedes.Binary()

over this:

class MyBaseObject(rlp.SerializableClassThatUsesMetaclassWizardy):
    field_a = sedes.Binary()

because there are less number of clever things happening behind the scenes, and there is more transparency to someone extending RLP classes. The fact that an inherited class can change the very structure of the derived class seems dangerous to me.

I don't like the way inheritance and field ordering is handled in both attrs and dataclasses as it's definitely not what one would expect.

Can you please help me understand what an ideal inheritance behaviour would look like?

jannikluhn commented 5 years ago

Can you please help me understand what an ideal inheritance behaviour would look like?

Ideally, order and set of fields is as easy as possible to define and to determine. I'm not certain that it's the ideal solution, but I like what Piper suggested above:

One option we haven't considered is requiring that either all fields be overridden in subclasses or none at all. Basically, rather than having a subtle API for ordering fields that we understand but is probably not intuitive, we can instead raise an exception during class construction if not all of the fields were overridden. This forces explicit field ordering in a manner that should make all RLP objects easy to know what the field ordering is since you just have to trace up the inheritance chain to the first object which defines any fields.

I guess this would be quite straightforward to implement in both the metaclass and the third party library approach.

pipermerriam commented 5 years ago

My intuition is that using attrs is likely to add performance overhead so it's worth evaluating what that overhead is because this library does have a need for high performance.

I think we could spin/bikeshed on design/approach for a while here and not really get anywhere because there's a lot of subjectivity as well as different angles to evaluate each approach. I'd advocate for you to do an MVP implementation of whatever approach you think is best (taking into consideration all of the things that have been discussed in this issue) at which point we can evaluate something more concrete.

pipermerriam commented 5 years ago

Also, I am still reasonably strongly of the opinion that a metaclass approach is going to be adequate, maintainable, and intuitive enough.

carver commented 5 years ago

One option we haven't considered is requiring that either all fields be overridden in subclasses or none at all.

:+1: I'm still in favor of this