allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.73k stars 2.24k forks source link

Completely remove custom from_params methods #3389

Closed sai-prasanna closed 4 years ago

sai-prasanna commented 4 years ago

Is your feature request related to a problem? Please describe.

Every registrable class has a classmethod from_params. We override it in cases where we want an alternative way to construct a object for the class.

Since we also use from_params as a dependency resolver, we have to inspect params object to satisfy dependencies of our main constructor.

Each class gets its dependency configuration and uses it to construct the object.

I think exposing the params to the class directly is kind of anti pattern. We will not be able to share same dependency object across the object sub graph. We already face this with vocabulary object, which we now handle by explicitly passing it down the subgraph along with configuration. But for cases like sharing layers/module etc, we need a better approach.

Describe the solution you'd like

Remove from_params from Registrable. We still retain how we write configuration jsonnet since it's convinient. But to construct the graph, we do it by constructing the object graph from leaf, up to the root, explicitly. While possibly warning if we hit a cycle.

Now, we can allow dependencies in the configuration to be satisfied by specifying a key to shared dependency in the object graph. Now we can even allow global singleton, shared layers etc.

And overridden from_params can be converted to alternate constructor that is used by default to construct the object nodes if its present.

Describe alternatives you've considered

Haven't considered the full complexitiy of implementing such a full blown dependency resolver.

DeNeutoy commented 4 years ago

Interesting idea, but it has some problems:

  1. If we allow shared references, from_params get's hugely complicated and very easy to shoot yourself in the foot if we implicitly share some references (i.e what we deem to be global state, such as the vocab) but don't share others, as at some point this will almost certainly clash with someone's intuition.

  2. We don't really want to be in the business of managing a dependency resolver.

  3. I'd like to encourage more scripted use of allennlp so that the first thought when you want to pass an object to multiple classes isn't "how can I make this work with from_params", but "this is easy to do if I write my own script". Making from_params more complicated does not help with this.

Having said that, we've had previous very good improvements to from_params, such as making it automatic for most classes which have type hints. So if you think the idea's good, a prototype version would be most welcome (but absolutely not guaranteed to be accepted as a contribution).

Thanks Sai!

sai-prasanna commented 4 years ago

To think out aloud, from_params would be replaced with a load() classmethod, where you declare your dependencies. say Embedding would provide 'pre_trained_file' option there, which it doesn't expose in the ordinary constructor. And it might also have Vocabulary there. So the classes will not be aware of what is global and what is not.

Now in the config json, we can allow people to provide keys as dependencies.

{ 
   "model': {
       "type": "some_model",
       "source_embedder": {
          "from": "model.target_embedder"
       },
      "target_embedder": {
           "type": "embedding",
           "embedding_dim": 120,
           "vocab": { "from": "vocabulary"}
       }
    }
   "vocabulary": {

    }
}

So if "vocabulary": { "from": "vocabulary"} is not provided, auto-filling it would clash with intuition that unspecified types of constructor are resolved by creating the default implementation. But isn't this how it already works? vocab is implicitly filled by from_params.

We can even generate the { "from": "vocabulary" } as a part of parsing the configuration and save the full specification in the model serialization directory, so that the resolution is explicit? And document that in the constructors?

  1. With regards to resolver, I was thinking of simple cycle detection in configuration and constructing objects from leaf to root.

  2. I thought likewise, I might prefer to do setup in python instead of jsonnet. One thing that prevents me is from_params also acts as a factory. i.e. In Embedding, we can create it with pre_trained_files, but this is not a constructor argument in from_params. Maybe this can be refactored into say a .load() method? This would eliminate fiddling with Params object when you want a factory to initialize the object instead of the actual constructor.

matt-gardner commented 4 years ago

@sai-prasanna, can you give a concrete use case for where you want this? More than just "sharing layers" - what exactly do you want to share that you have a hard time doing now? It's really hard to design a solution to a problem that is under-specified.

sai-prasanna commented 4 years ago

@matt-gardner Other than sharing layers/embedders, I don't have a concrete use case. I was thinking of abstract cases where someone might require sharing something across DatasetReader and model.

Having thought this through, I affirm @DeNeutoy 's assessment that programmatic interface can help people who find the current object construction not flexible enough for their use case.

Having said that, my small gripe is for the programmatic interface for classes that override from_params, we have coupled the object construction logic with configuration resolution. People would have to replicate code in the overriden from_params.

Is this worth the effort to separate these out and have a .factory classmethod which gets resolved in the same way as the current constructor, and no one has to override from_params unless to support backward compatibility of our configurations. One advantage would be we can now document the extra configuration now parsed in from_params as part of the factory method's signature.

matt-gardner commented 4 years ago

I'm not sure what you mean by the .factory method. What goes there?

It seems like there are two broad classes of things that have issues with our current from_params:

  1. TrainerPieces - here we have a bunch of logic for constructing objects inside of from_params, and saving them to a NamedTuple.
  2. Embedding and Vocabulary (and similar) - here we have different paths to construct the object, and we hide logic inside of from_params to route to the different construction paths.

Are there any other major issues with how from_params works? Basically everywhere else it's entirely transparent, just using the default implementation. So, if we want to improve from_params, we should figure out how to improve each of those problematic areas.

  1. It seems like in these cases we can just make this a constructor with explicit dependencies. That is, change the existing from_params method to an __init__ with the correct arguments. Then the problem just goes away. I don't know why we didn't do this to begin with (maybe this object was written before we had auto from_params?).
  2. a. For these, we could just make the different construction paths (e.g., for Embedding, either initializing from a file or randomly initializing the weights) different subclasses, with different registered types. Then those constructors would have all the logic needed to construct their state and pass it to super(), and the from_params again goes away, because the auto from_params works. b. As a tweak on the above, we could add an optional parameter to Registrable.register, which is the classmethod to use as the constructor. This way we could do something like this:
@TokenEmbedder.register("embedding-from-file", method="from_file")
@TokenEmbedder.register("embedding-random-init", method="random_init")
class Embedding:
    ...

And now we don't have to deal with subclasses, but we get explicit constructors everywhere, with obvious type signatures and no hidden parameters.

2b seems to me like a no-brainer that we should definitely do, as it (along with 1) would let us remove all of our custom from_params. Any problems with it that I'm not seeing? The downside is that we wouldn't register just "embedding" anymore, or we'd have to have some kind of backwards-compatible shim to make sure we don't break everyone's code. The shim would be relatively straightforward, though, just peeking at a parameter and then passing it off to the appropriate other registered method. You could maybe also implement the shim as just another @TokenEmbedder.register("embedding", method="shim"), if you're careful about extra parameters and passing them along (this might not work with current FromParams, though; not sure).

matt-gardner commented 4 years ago

The other major issue with how from_params works is how we handle super class parameters. It would be nice to be able to add a parameter to a base class without having to change code for all subclasses. This one's a bit tricky to get right, though.

matt-gardner commented 4 years ago

Oh, I think for 1 above, the reason we didn't make the from_params an __init__ is because the objects have sequential dependencies. You have to construct the Vocabulary before you construct the Model. There are a few ways we could handle that, but maybe the best is to add a Lazy class, so you can do something like:

class TrainerPieces:
    def __init__(self, ..., vocab: Lazy[Vocabulary], model: Lazy[Model]):
        ...  # instantiate datasets and such
        self.vocab = vocab.construct(additional_vocab_args)
        self.model = model.construct(vocab)

Lazy would need to override from_params to just save the parameters that it gets, then call the generic's from_params method later with the additional arguments that it gets in construct. What do you think?

matt-gardner commented 4 years ago

Sorry for all the comments, I'm commenting as I'm thinking about this. We might want to combine the two solutions for TrainerPieces:

@BaseTrainerPieces.register("trainer-pieces", method="from_lazy_objects")
class TrainerPieces(NamedTuple):
    model: Model
    ...
    @classmethod
    def from_lazy_objects(cls, ... vocab: Lazy[Vocabulary], model: Lazy[Model]):
        ...

This way, if you're writing your own script, you don't have to try to deal with instantiating a Lazy[Vocabulary], which would be a nightmare, you can just use the constructor. But we still get the benefits of delayed construction and transparent arguments and typing.

matt-gardner commented 4 years ago

In the end this is similar to @sai-prasanna's factory method idea, it's just explicitly registering them instead of the constructor, and then using our existing from_params logic.

sai-prasanna commented 4 years ago

@matt-gardner This is great!, Thanks for hashing this out.

There is a problem in making Lazy as Registrable though. In your conception of Lazy, I believe the Vocabulary object would be passed to Lazy class, through a static/class method, and Lazy can override from_params to resolve the type by just using the type as a key and already set object as a value. There is a limitation in the current python typing Generics system preventing access to 'T' in 'Lazy[T]' inside classmethods. https://github.com/python/typing/issues/629 .

One hack around this would be, instead of making Lazy as a Registrable we could use it as a Generic Wrapper (Lazy[Vocabulary]) type which we can inspect in runtime, to get the inner type and resolve it from extras.


from typing import Generic, TypeVar
T= TypeVar('T')
# Wrapper without any body
class Lazy(Generic[T]): ...

# extras would be Type -> instance dictionary instead of str -> instance

def construct_arg(
    cls: Type[T], param_name: str, annotation: Type, default: Any, params: Params, **extras
) -> Any:
    name = param_name
    origin = getattr(annotation, "__origin__", None)
    args = getattr(annotation, "__args__", [])

    # The parameter is optional if its default value is not the "no default" sentinel.
    optional = default != _NO_DEFAULT

    # Some constructors expect extra non-parameter items, e.g. vocab: Vocabulary.
    # We check the provided `extras` for these and just use them if they exist.
    # if name in extras:
    #    return extras[name]
    if origin == Lazy:
       return extras[name]

I think we can use the parameter name to resolve the parameters which are annotated with Lazy types. This would be beneficial in case we want multiple instances of same Lazy type to be resolved. Maybe we can also provide a hook to add additional values to extras, thereby solving custom globals problem (if we allow additional objects to be added to extras via config)

Now we can even resolve multiple vocab objects by just naming the parameters in the factory/constructor differently and annotating them as Lazy[Vocabulary]` provided the parameter names are constructed for the extras borehand.

And a nitpick with the name Lazy. I rather think it could be Context or Singleton or Global ?

sai-prasanna commented 4 years ago

One small problem remains, in the all the classes we would have to change vocab: Vocabulary into Lazy[Vocabulary] which might cause mypy errors :( as we are abusing the type system to add a meta info. I don't think we can do something like vocab: @Lazy.annotate(Vocabulary) in python.

sai-prasanna commented 4 years ago

~Another solution without abusing the type system would be to have an static optional attribute in the class.~ ~class TrainerPieces:~ ~_lazy_initalize = {~ ~"from_lazy_objects" : ["vocab", "model"]~ ~}~

Maybe we can register which attributes are to be filled lazily in the register method?

@BaseTrainerPieces.register("trainer-pieces", method="from_lazy_objects", lazy=["vocab", "model"])
matt-gardner commented 4 years ago

I don't think that passing in what is lazy in the register() method will work - the inherent problem is that the objects get constructed inside the method, after their dependencies have been constructed. This means that you need to only pass in their parameters, not the objects themselves, so you need something different so the types work out. I do think that Lazy or Incomplete or something is a better name here; Singleton and Global don't actually match what's going on. This is a different problem than you were originally going for with things getting re-used in multiple places.

And for your point about a generic Lazy - we are able to successfully get the types for things like List[Tuple[int, MyType]], so I'm hoping we'll be able to figure this out. At the least, we can probably put some special-cased logic in FromParams.from_params to check for an instance of Lazy (like we do for List, etc.). We should have access to the type parameter there, and we can pass it along to the Lazy's constructor. We probably don't even need to have Lazy implement from_params, because we can special case its logic in FromParams.from_params.

Now that I'm back from EMNLP, my schedule is feeling a little less compressed, and I might have time to implement this myself this quarter. I'm going to try.

sai-prasanna commented 4 years ago

@matt-gardner I don't understand your point. When parameters get constructed, they are constructed inside from_params class method, but for objects like Vocabulary they are constructed outside and passed via "**extras".

https://github.com/allenai/allennlp/blob/3dda5ac9fcf390d8b83eb855249360711707c11c/allennlp/common/from_params.py#L24

So when constructing a object in the default implementationFromParams.from_params, we can effectively access the registry to get which parameters are to be resolved from extras. We currently resolve it implicitly based on the name, i.e. any constructor parameter named "vocab" will automatically get resolved to the vocab passed via extras. And as extras is passed down the object tree hierarchy, we are able to resolve it everywhere. (It makes all the objects in the extras are actually singletons). My idea is just to make it explicit. Am I missing something else (usability?) in this design?

matt-gardner commented 4 years ago

I'm trying to solve a different issue. I think the way we handle vocab is fine - the object is constructed before MyObject.from_params, and gets passed in. The places where we need to override from_params are (1) when an object gets constructed inside MyObject.from_params (because of a sequential dependency between arguments to MyObject.from_params), and (2) when we implicitly have multiple constructors with different arguments to each. What I've outlined above fixes those two issues, leaving how vocab is handled the same. With those two issues gone, we basically never have to override from_params, it's just behind-the-scenes magic that doesn't get in the way of programmatic usage, and so the issues you brought up in the original post largely go away. Yes, it's still hard to share layers, but we already have a partial solution to that, and I'm not convinced it's a significant pain point like custom from_params is.

sai-prasanna commented 4 years ago

Great, thanks @matt-gardner for discussing this and taking it up.

matt-gardner commented 4 years ago

Thanks @sai-prasanna. I want to keep this open, though, until all of the custom from_params methods have actually been removed (except perhaps in some custom wrapper things that we're doing, which aren't user-facing anyway). I think I've added all of the pieces necessary to do this now, but we won't know for sure until all of the methods are actually gone.