explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.18k stars 4.4k forks source link

SpanGroups.setdefault may return the given default arg--not necessarily the SpanGroup #10670

Closed single-fingal closed 2 years ago

single-fingal commented 2 years ago

Problem:

When calling .setdefault() on a SpanGroups object (e.g. Doc().spans),

This can be particularly problematic if the return value of setdefault() is operated upon as if it was the intended SpanGroup (and the aforementioned conditions were true).

For example:

(Executed with:

)

>>> import spacy
>>> from spacy.tokens import Doc, Span, SpanGroup
>>> nlp = spacy.blank('en')
>>> doc = nlp('This is a doc.')
>>> doc.spans
{}
>>> span1 = Span(doc, 0, 1)
>>> span2 = Span(doc, 1, 2)

>>> # Initial update (key not present):
>>> doc.spans.setdefault('mygroup', []).append(span1)
>>> doc.spans
{'mygroup': []}
>>> # ---------------------------------------------
>>> # Expected:
>>> #   set(doc.spans['mygroup']) == {span1}
>>> # ---------------------------------------------
>>> isinstance(doc.spans['mygroup'], SpanGroup) # sanity check
True

>>> # Subsequent update (key present):
>>> doc.spans.setdefault('mygroup', []).append(span2)
>>> doc.spans
{'mygroup': [is]}
>>> # aka:
>>> #   set(doc.spans['mygroup']) == {span2}
>>> # Notice that subsequent calls to setdefault() with an existing key
>>> # operate as expected (i.e. the return value is the SpanGroup).
>>> # ---------------------------------------------
>>> # Expected:
>>> #   set(doc.spans['mygroup']) == {span1, span2}
>>> # ---------------------------------------------
>>> isinstance(doc.spans['mygroup'], SpanGroup) # sanity check
True

In general, users may attempt to initially populate a SpanGroup via the return value of setdefault(), but find that their initial .append/.extend/... is not applied to the intended SpanGroup; their initial update is essentially ignored.

Cause:

SpanGroups subclasses UserDict but does not implement setdefault itself, and CPython's UserDict uses the MutableMapping implementation of setdefault.

CPython v3.10.4's MutableMapping.setdefault implementation is exactly:

def setdefault(self, key, default=None):
    'D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D'
    try:
        return self[key]
    except KeyError:
        self[key] = default
    return default

So, if a MutableMapping subclass (e.g. UserDict, and thus SpanGroups) implements its own __setitem__ and that __setitem__ may "alter" the stored value, as is the case with SpanGroups (when it changes a non-SpanGroup value to a SpanGroup via _make_span_group), then the value returned by the setdefault() call is not the altered value, but the value given as the default arg to setdefault().

(Given Python's dynamic nature (and my remaining ignorance about Python), I was a bit surprised by this behavior in a standard library function. I took a look and found that someone already reported this on BPO a couple years ago (https://bugs.python.org/issue41647; now mirrored at python/cpython#85813) and it's marked as "not a bug".)

As it works in my example above: when calling e.g. doc.spans.setdefault('mygroup', []), if the 'mygroup' key is not yet present in the doc.spans SpanGroups, a new SpanGroup(doc, name='mygroup', spans=[]) will be created as the value for the new 'mygroup' key, and the empty list [] passed to setdefault() will be returned. Any operations on that returned value apply to that list--not the SpanGroup that was created.

Solution:

(Forthcoming (within a day); I've just written it--it's only a few lines--but I wanted to double-check your contributing guide and consider including a test, etc.)

Alternative:

Alternatively, this could be considered "not a bug" for spaCy as well, and users would be expected to pass a SpanGroup instance as a setdefault() default value (or, in general, keep in mind how UserDict.setdefault behaves). Though it would be a bit verbose/redudant, e.g.:

doc.spans.setdefault('mygroup',
                     SpanGroup(doc, name='mygroup', spans=[])) \
                    .append(span1)

I figured that since SpanGroups is noted as a dict-like proxy, it didn't necessarily have to inherit quirks of e.g. UserDict.

adrianeboyd commented 2 years ago

This sounds like there are a lot of nuances to consider.

Let me first back up a step: what are some use cases for using setdefault with SpanGroups?

single-fingal commented 2 years ago

Sure. The first (set of) use case(s) that comes to mind involves using a Matcher or PhraseMatcher to annotate spans. Doc.spans might be used in the first place, instead of Doc.ents, because the annotated spans might be "overlapping". When the *Matcher object performs its matching, SpanGroups may be created/updated with the match results.

An example:

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.blank('en')
LABEL_PATTERNS = list(nlp.pipe(['Find', 'these strings']))
matcher = PhraseMatcher(nlp.vocab)
matcher.add('LABEL', LABEL_PATTERNS)

doc = nlp('Will this find these strings? Find this.')
# Here, arbitrary functionality (e.g. (other) pipeline components)
# could alter `doc`'s SpanGroups,
# later necessitating checking whether some SpanGroup
# at a given key exists:
... # arbitrary functionality
for match_id, start, end in matcher(doc):
    # Alternatively, an `on_match` callback to the Matcher()/PhraseMatcher()
    # could essentially perform what we're doing here:
    ... # arbitrary functionality
    span = doc[start:end]
    # If the 'label' SpanGroup already exists, append the span to it;
    # otherwise, create the 'label' SpanGroup and append to it.
    # (The user could perform these steps "manually"
    #  --checking `if 'label' in doc.spans: ...; else: ...`--
    #  but that's exactly the use case for setdefault() in general,
    #  and SpanGroups already has a .setdefault (implicitly) defined.)
    doc.spans.setdefault('label', []).append(span)
    # Or, perhaps the user would assign the span to a different SpanGroup,
    # depending on it/its context:
    #if 'i' in span.text:  # as an example condition
    #    doc.spans.setdefault('label', []).append(span)
    #else:
    #    doc.spans.setdefault('otherlabel', []).append(span)

Another example might follow the same general approach, but e.g. use a more complicated rule to find "errors" within a given Doc (perhaps with, e.g., each "type" of error being in its own SpanGroup).

Currently, after running the code above, the result is:

>>> doc.spans
{'label': [Find]}

(since the initial update via SpanGroups.setdefault is essentially "ignored", as noted above) when a user would expect it to be

>>> doc.spans
{'label': [these strings, Find]}

But there are additional issues with the current implementation. For example:

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.10/_collections_abc.py", line 1009, in setdefault self[key] = default File "/home/user/.local/lib/python3.10/site-packages/spacy/tokens/_dict_proxies.py", line 30, in setitem value = self._make_span_group(key, value) File "/home/user/.local/lib/python3.10/site-packages/spacy/tokens/_dict_proxies.py", line 36, in _make_span_group return SpanGroup(doc, name=name, spans=spans) File "spacy/tokens/span_group.pyx", line 51, in spacy.tokens.span_group.SpanGroup.init TypeError: 'NoneType' object is not iterable

(that is, even without using the returned object from setdefault().)
This is due to the `setdefault` "default" arg being set to `None` by default in the MutableMapping.setdefault implementation, and `None` currently not being able to be transformed into a SpanGroup.

It seems to me like the SpanGroups collection would only ever have `SpanGroup`s as values, and if so, I see it as helpful if its API--including `setdefault`--reflects that. If a SpanGroups value must be a SpanGroup, then only a SpanGroup value would make sense as a "default" value (e.g. to be passed to `setdefault()`); for any value that is not a SpanGroup already, we would attempt to transform it into a SpanGroup (as `SpanGroups.__setitem__` already does). The change here would be to have `SpanGroups.setdefault` also perform this transformation, essentially.

Then, if the value used in setdefault is or could be a SpanGroup, the user could make use of the returned value from setdefault() (as a SpanGroup). As it is now, something like:
```python
>>> doc.spans.setdefault('newergroup', ()).append(doc[2:3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'

fails overtly, since the (non-SpanGroup) value passed to setdefault()--a tuple--has no "append" attribute. But something like:

>>> doc.spans.setdefault('newestgroup', []).append(doc[3:4])

fails silently--that is, if the user expects a SpanGroup to be returned--because the append() in this case applies to the list that was passed to setdefault().

Probably the biggest thing to consider for this matter is due to convention: what does a user expect to have returned from SpanGroups.setdefault()? If they think it will behave like UserDict.setdefault(), then the given default value--whatever (arbitrary) object it is--would be returned if the container doesn't have the given key. In general, I think it doesn't make much sense to pass a "default" value that could not be stored as a value in the container. In the case of SpanGroups, only a value that is already a SpanGroup or could be transformed into a SpanGroup would be acceptable. SpanGroups.__setitem__ essentially already has this requirement, and e.g. indirectly raises a TypeError if it's not provided with such a value (as we saw above, when trying to use None as a value).

The main thing that users would need to be aware of is: SpanGroups.setdefault() will always return a SpanGroup. If they passed a SpanGroup, (and the given key is not already in the SpanGroups,) then that same object will be returned. If they passed a SpanGroup-able object, (and the given key is not already in the SpanGroups,) then a different object--a SpanGroup--will be returned. In every case, the user can expect SpanGroups.setdefault() to return a SpanGroup object (--if it doesn't raise an exception). Personally, I think this approach for SpanGroups is more intuitive.

adrianeboyd commented 2 years ago

Thanks for the examples! In the code above the issues do really just seem to be stemming from non-SpanGroup defaults. A SpanGroups dict is only intended to store SpanGroup objects, so I'm not sure that we want the default to be anything other than a SpanGroup. So you could use this instead:

doc.spans.setdefault("label", default=SpanGroup(doc)).append(doc[1:2])

The end result is also not quite identical because the SpanGroup doesn't have the same name, so to be exact it would be:

doc.spans.setdefault("label", default=SpanGroup(doc, name="label")).append(doc[1:2])

I'm not aware of anywhere the name currently matters off the top of my head, but I haven't searched in detail.

However I think this is going to be slower, which is a concern in a loop, and in some rough tests it does look like it's ~15% slower than a case with an unused default=[].

And adding a check like this to the loop seems to be about the same speed as setdefault with an unused default:

    if "label" not in doc.spans:
        doc.spans["label"] = []

I think it might make sense to override setdefault to enforce the type of the default, or if there are still cases with confusing behavior, it might make more sense to disable setdefault instead. Let me also check with others on the team about their opinion on this, too.

single-fingal commented 2 years ago

Sounds good; take your time.

Yeah, I noticed that users would have to pass a SpanGroup as the default arg to setdefault() if they wanted to get that same SpanGroup returned (assuming there's not already a SpanGroup at that key; otherwise, a different SpanGroup would be returned). Though it is a bit verbose/redundant--having to import SpanGroup and specify the same Doc object and the same name[^1]. And it does also create a new SpanGroup object which may not be used (if the key is already in the SpanGroups), which I think your test numbers reflect.

From my rough tests, I actually saw a ~15% improvement when passing a SpanGroup() to setdefault(), but only when there was a single setdefault() call + when there was not already a SpanGroup at the given key--which is to be expected, because SpanGroup() would be called (implicitly) eventually, in such a case, and having the default value already be a SpanGroup avoids some branching/calls. When you start having more setdefault() calls, however, the performance of e.g. a tuple() being passed to setdefault() is faster: about 15% faster when there are just 4 calls to setdefault() given the same key, and this increases as there are more calls to setdefault() given the same key (e.g. ~60% faster at 100 calls). (Note: my testing code for this was basically comparing:)

# create a Doc(), then:
doc.spans.setdefault('test', SpanGroup(doc, name='test'))
# or:
doc.spans.setdefault('test', ())
# each within a loop of 1, 4, 100, ... iterations

[Side note: I hope my most recent edits to my GitHub comments are showing (by default) to others, because my first comment in this issue is correctly showing my latest edit to me, but my prior comment in this issue is not showing (by default) my most recent edit. (I can still see it, though, if I click to see the edits.) I'm still pretty new to GitHub, so I could be doing something wrong. I'll try sorting it out.]

I think we wouldn't have to be so strict as requiring a SpanGroup as a default value, but an object that is either a SpanGroup or "SpanGroup-able" (to continue using that term), which is basically what SpanGroups._make_span_group() accepts. If SpanGroups.setdefault accepted these two ~types of values, then it would echo the behavior of SpanGroups.__setitem__.

Then, if we accept those two ~types, the setdefault() default argument could be changed to e.g. a tuple() (as I use in #10676), which is SpanGroup-able. That would then fix the problem with calling setdefault() with optional arguments omitted.

Enforcing the type of the default would definitely be a step in the right direction, I think--and at least making it so that setdefault() called with the optional args omitted wouldn't fail. Basically the only other change this ticket requests is that setdefault() also returns the set SpanGroup at the given key (of the SpanGroups).

[^1]: from a quick search, the main purpose that I found for the SpanGroup name is simply for serialization and deserialization: the key for the SpanGroup within the SpanGroups object is actually ignored, and the SpanGroup.name is used for serialization. When deserializing SpanGroups, the SpanGroup.name is used as the key of the SpanGroups object (--the value of the key then being the SpanGroup itself).

Actually, **we may have a separate problem** here: if a SpanGroup.name conflicts with another SpanGroup.name in a `SpanGroups` object--but each are values of different keys of the SpanGroups object--then when serializing and then deserializing the `SpanGroups`, only 1 of the `SpanGroup`s will be present in the `SpanGroups` object.

This could occur e.g. if a user incorrectly enters a SpanGroup name when initializing the object and setting it as a value in a `SpanGroups` object (or whenever they enter a duplicate--in the `SpanGroups` ~namespace--SpanGroup name, in general). e.g.:

```python
doc = nlp("Will it blend?")
# To follow the same format used in some of the examples we've discussed,
# the user could set SpanGroups in the following (or another) way:
doc.spans.setdefault('test',  SpanGroup(doc, name='test', spans=[doc[0:1]])
doc.spans.setdefault('test2', SpanGroup(doc, name='test', spans=[doc[1:2]])
>>> doc.spans
{'test': [Will], 'test2': [it]}
>>> doc.spans.from_bytes(doc.spans.to_bytes())
{'test': [it]}
```

(Of course, if setdefault() is modified as discussed, that will at least reduce the likelihood that something like the above would occur--since they wouldn't have to provide a SpanGroup() to setdefault().)

We may want to at least have `SpanGroups` ensure that no added SpanGroup has the same name as another SpanGroup in the `SpanGroups` object (though this would either have to iterate over the `SpanGroup`s or cache their names, **or** we could require that a SpanGroup.name matches the key it is the value of in `SpanGroups`). But I wonder if it should also be the case that: a SpanGroup is prevented from changing its name (e.g. via attribute assignment) to a name of another SpanGroup in its same `SpanGroups` object (or to a name that's different from the key it is the value of in `SpanGroups`). ...

I'll see if this has been reported or fixed already, and if not, open up an issue--which I did, here: #10685.
github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.