Datatamer / tamr-client

Programmatically interact with Tamr
https://tamr-client.readthedocs.io
Apache License 2.0
11 stars 25 forks source link

Builder pattern / implementation #252

Closed pcattori closed 4 years ago

pcattori commented 5 years ago

💬 RFC

We have PRs that implement the builder pattern for some resources (#237 , #249 , #250 ). We should decide on overall design for how we want to apply the builder pattern in our codebase.

E.g.

🔦 Context

💻 Examples

nbateshaus commented 5 years ago
  1. Yes, builder should be accessible via MyClass.builder() on all classes for which Builder is appropriate.
  2. Builder.build() should be the one place where the logic for going from MyClass's API to the Tamr server's JSON API is encapsulated. That is, MyClassBuilder.build() should produce the JSON that can be sent to the server.
  3. We should use REST verbs; they're already ubiquitous in the code, and, as this is a client for a REST API, they are the best description for what's actually going on. MyClassBuilder.put() should be:
    def put(self):
    data = self.client.put(self.api_path, json=self.build()).successful().json()
    return MyClass(self.client, data, self.api_path)
juliamcclellan commented 5 years ago

@charlottemoremen, @Nelson-Andrade, and I have two possible options:

The first is to have a BaseBuilder class which handles overriding properties and has an abstract put. Each resource would have a concrete builder with with_property functions that created new instances of the builder class.

class X:
    def __init__(self, client, data, alias):
        self.client = client
        self._data = data
        self.api_path = alias

    def a(self):
        return self._data.get("a")

    def build():
        return XBuilder(self)

class BaseBuilder:
    def __init__(self, contained, **kwargs):
        self._contained = contained
        self.__dict__.update(**kwargs)

    def __getattr__(self, name):
        return getattr(self._contained, name)

    @abstractmethod
    def _build(self):
        pass

    @abstractmethod
    def put(self, resource_class):
        new_data = self.client.put(self.api_path, json=self._build()).successful().json()
        return resource_class(self.client, new_data, self.api_path)

class XBuilder(Builder):
    def with_a(self, new_a):
        return XBuilder(self, a=new_a)

    def _build(self):
        return {
            "a": self.a
        }

    def put(self):
        return super().put(X)

The second is to not have separate builders, but to put all of the above into each resource itself.

class X:
    def __init__(self, client, data, alias):
        self.client = client
        self._data = data
        self.api_path = alias

    def a(self):
        return self._data.get("a")

    def with_a(self, new_a):
        new_x = X(self.client, self._data, self.api_path)
        new_x.__dict__ = dict(self.__dict__)
        new_x.__dict__["a"] = new_a
        return new_x

    def _build(self):
        return {
            "a": self.a
        }

    def put(self):
        new_data = self.client.put(self.api_path, json=self._build()).successful().json()
        return X(self.client, new_data, self.api_path)
nbateshaus commented 5 years ago

In the first example, put() is very much a Template Method. The normal implementation of Template Method uses abstract methods to delegate; so, instead of passing resource_class to the method, it would have a _resource_constructor(*args, **kwargs) abstract method, which it would invoke at the appropriate point in execution.

build() should also be part of the public interface; it's generally useful, e.g. to post() on related collections.

So:

    @abstractmethod
    def build(self):
        pass

    def put(self):
        new_data = self.client.put(self.api_path, json=self.build()).successful().json()
        return self._resource_constructor(self.client, new_data, self.api_path)

    @abstractmethod
    def _resource_constructor(self, *args, **kwargs):
        pass

class XBuilder(Builder):
    def build(self):
        return {
            "a": self.a
        }

    def _resource_constructor(self, *args, **kwargs):
        return X(*args, **kwargs)
nbateshaus commented 5 years ago

In the second example, we've combined the class representing the resource, and the builder. This might be fine, but:

  1. It moves the resource class away from its single purpose, which was to represent a resource fetched from the server.
  2. It broadens the interface of the resource class, making it more difficult to understand.
nbateshaus commented 5 years ago

I would be remiss if I did not also call out the desire to favor composition over inheritance for the Template Method. Template combines well with Strategy, so that when an XBuilder instance is constructed, it can plug in the implementation of _resource_constructor rather than implementing a parent class' abstract method. BUT, since XBuilder wants a bunch of with_* methods that are specific to X, it makes sense to have a Builder class that is specific to X. If we are subclassing Builder anyway, then mixing composition with inheritance just complicates things, and we should stick with inheritance.

nbateshaus commented 5 years ago

Motivating use cases:

  1. Toggle mlEnabled on a bunch of AttributeConfigurations. I have a dataset with 595 attributes; I tried enabling ML for all of them, but that was wrong. Now I want to toggle it off for 590 of the 595 attributes.
  2. Set tags and update the description on datasets that match a pattern. For example, go though all datasets; any that has a geometry attribute, I want to tag geo. I also want to update the description of these datasets to say "Contains geospatial features."
pcattori commented 5 years ago

Builder-like pattern for creating resource specs.

Approach:

class MyResourceSpec:
    def __init__(self, data):
        self.data = data

    @staticmethod
    def from_my_resource(my_resource):
        return MyResourceSpec(deepcopy(my_resource._data))

    def with_property_a(self, a):
         return MyResourceSpec({**self.data, 'a': a})

    def with_property_b(self, b):
        return MyResourceSpec({**self.data, 'b': b})

    def to_dict(self):
        '''Return a copy of `self.data`'''
        return deepcopy(self.data)

    def put(self):
        ...

Additionally, we can add a spec() method to MyResource to initialize a spec:

class MyResource:
    ...

    def spec(self):
        return MyResourceSpec.from_my_resource(self)

Usage:

my_resource = MyResource(...)
my_resource.spec().with_property_a('🍔').put()

TODO

pcattori commented 5 years ago

From reviewing #262 , I have a couple observations:

  1. The common case is to keep client and api_path the same while only updating data.
  2. If you already have a dict/JSON mapping for "property -> updated value", you have to deconstruct your mapping into 1-by-1 with_* calls.

Proposed solution for (1)

class MyResourceSpec:
    def __init__(self, client, api_path, data):
        self.client = client
        self.api_path = api_path
        self.data = deepcopy(data)

    @staticmethod
    def of(my_resource):
        # note: MyResource.spec() needs to be updated to call `MyResourceSpec.from`
        return MyResourceSpec(my_resource.client, my_resource.api_path, my_resource._data)

    def from_data(self, data):
        return MyResourceSpec(self.client, self.api_path, data)

    def with_property_a(self, a):
        return self.from_data({**self.data, 'a': a})

    ...

Proposed solution for (2)

We can use solution (1) to deal with this easily:

r = MyResource(a=1, b=2, c=3)
r.a # -> 1
r.b # -> 2
r.c # -> 3
updates = { 'a': 'x', b: 'y' }
r.spec().from_data({**r._data, **updates}).put()

Since we don't want users touching _data directly, we could offer a convenience method, update:

r = MyResource(a=1, b=2, c=3)
r.a # -> 1
r.b # -> 2
r.c # -> 3
updates = { 'a': 'x', b: 'y' }
r.spec().update(updates).put()

I'm less thrilled with (2). Maybe we don't need the power of batched updates....

juliamcclellan commented 5 years ago

What @nbateshaus and I discussed about the contracts of with_* functions: If a property of a resource is a complex type (like an attribute type), it should have its own spec class, and a user should provide an instance of this spec to with_*. If a property is not a complex type (e.g. a string), it would not have a spec class, and the user would provide the simple representation directly. If the property is a list/set of resources, the user would provide a list of these resources, represented as determined by the above.

pcattori commented 5 years ago

Design finalized and implemented (see #279 for an example)

pcattori commented 5 years ago

Superceded by #309

pcattori commented 5 years ago

@nbateshaus , you previously state:

MyClassBuilder.build() should produce the JSON that can be sent to the server.

Should .build() (or whatever we call it) be responsible for checking invariants such that it produced JSON that can be sent to the server that the server will be happy with?

Or do you think its better to let the server handle validation logic and let the server complain when the spec is invalid?

E.g. for an AttributeTypeSpec what if you don't set an inner_type, but you do pass in sub attributes? Should we just construct the JSON and let the server tell us that its wrong? or should we try to catch the invalid JSON before sending it to the server?

pcattori commented 4 years ago

New proposal: create and update functions should specify requirements for each field (without needing a spec class).

E.g. for attributes:

# Attribute.py
def create(
    session: Session,
    dataset: Dataset,
    *,
    name: str,
    type: AttributeType.AttributeType,
    is_nullable: bool,
    description: Optional[str] = None,
) -> Attribute:
    ...

def update(
    session: Session,
    attribute: Attribute,
    *,
    # TODO can is_nullable be updated?
    is_nullable: Optional[bool] = None,
    description: Optional[str] = None,
) -> Attribute:
    ...

Then users can call like:

attr = Attribute.create(
    tamr.session,
    name='address',
    type=...,
    is_nullable=False,
    description='US-based address'
)
...
attr = Attribute.update(
    tamr.session,
    attr,
    description='US-based address with postal code'
)

Benefits:

pcattori commented 4 years ago

Going forward with proposal above^^^.

Feel free to reopen this issue for further discussion if needed.