explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.18k stars 4.4k forks source link

Language.factory cannot be a subclass without nlp, name args #10611

Open BramVanroy opened 2 years ago

BramVanroy commented 2 years ago

How to reproduce the behaviour

Try to import/loading SubClass in this example.

from dataclasses import dataclass

from spacy import Language

@dataclass
class SuperClass:
    nlp: Language
    name: str

@Language.factory("dummy")
class SubClass(SuperClass):
    def __init__(self, *args, sub_arg: bool = True):
        super().__init__(*args)
        self.sub_arg = sub_arg

This will lead to the following error

File "dummy.py", line 13, in <module>
    class SubClass(SuperClass):
  File "spacy\language.py", line 490, in add_factory
    raise ValueError(Errors.E964.format(name=name))
ValueError: [E964] The pipeline component factory for 'dummy' needs to have the following named arguments, which are passed in by spaCy:
- nlp: receives the current nlp object and lets you access the vocab
- name: the name of the component instance, can be used to identify the component, output losses etc.

However, from the surface it would seem that this should just work: ultimately, SubClass does require nlp and name through its super SuperClass. The reason that this errors, to me, is get_arg_names which only retrieves the arguments for the current class - not its tree.

We could solve this by collecting all args by traversing the class's __mro__. This, for instance, does work:

param_names = []
for cls in SubClass.__mro__:
    param_names += inspect.signature(cls).parameters.keys()
# ['args', 'sub_arg', 'nlp', 'name']

Your Environment

Info about spaCy

adrianeboyd commented 2 years ago

It's useful to have this example in the issue tracker in case anyone runs into the same problem, but I don't think we want to add more complicated handling when it's relatively straight-forward to have nlp and name arguments in the subclass.

BramVanroy commented 2 years ago

I'll just use @polm's recommendation to use a separate constructor function for my factory, so this problem won't occur.

That being said, simply repeating the arguments of the super class of course defeats the point of subclassing in the first place (DRY). I see that get_arg_names is only used in this case, so the only change that would need to be made to support this behavior is changing that function to:

def get_arg_names(func: Callable) -> List[str]:
    """Get a list of all named arguments of a function (regular,
    keyword-only).
    func (Callable): The function
    RETURNS (List[str]): The argument names.
    """
    try:
        return [key for cls in func.__mro__ for key in signature(cls).parameters.keys()]
    except AttributeError:
        return list(signature(func).parameters.keys())

which does not break any existing behaviour, does not change its usage, and does extend usability to extended object hierarchies.

Minimal example

from inspect import signature
from typing import Callable, List
from dataclasses import dataclass

from spacy import Language

@dataclass
class SuperClass:
    nlp: Language
    name: str

class SubClass(SuperClass):
    def __init__(self, *args, sub_arg: bool = True):
        super().__init__(*args)
        self.sub_arg = sub_arg

def get_arg_names(func: Callable) -> List[str]:
    """Get a list of all named arguments of a function (regular,
    keyword-only).
    func (Callable): The function
    RETURNS (List[str]): The argument names.
    """
    try:
        return [key for cls in func.__mro__ for key in signature(cls).parameters.keys()]
    except AttributeError:
        return list(signature(func).parameters.keys())

def my_func(arg1, arg2):
    pass

if __name__ == '__main__':
    print(get_arg_names(my_func))
    print(get_arg_names(SuperClass))
    print(get_arg_names(SubClass))