Feature request: Unify span getters / setters

Thomzoy commented 1 year ago

Feature type

We might want to have a more uniform way of getting spans in pipelines. Currently, we have on_ents_only, on_spans, etc... An idea is to expose a span_getter key in the configuration that could look like:

span_getter = dict(
    ents = True,
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)

If a more complex getter is needed, it could come from a span_getter factory

percevalw commented 1 year ago

To complete this, here are the different uses of such getters in the lib atm:

on_ents_only: in span classification pipes, such as negation or hypothesis to only classify ents instead of all tokens (this is specific to the implementation of the context algorithms), and in dates to only detect dates inside existing entities (useful for normalization purposes)

on_spans_groups/on_ents: used in span qualifier to retrieve the list of ents / spans that should be classified

    on_ents: Union[bool, Sequence[str]]
        Whether to look into `doc.ents` for spans to classify. If a list of strings
        is provided, only the span of the given labels will be considered. If None
        and `on_span_groups` is False, labels mentioned in `label_constraints`
        will be used.
    on_span_groups: Union[bool, Sequence[str], Mapping[str, Sequence[str]]]
        Whether to look into `doc.spans` for spans to classify:

        - If True, all span groups will be considered
        - If False, no span group will be considered
        - If a list of str is provided, only these span groups will be kept
        - If a mapping is provided, the keys are the span group names and the values
          are either a list of allowed labels in the group or True to keep them all

ent_labels / span_labels: in trainable NER pipe to retrieve the list of ents / spans that should be extracted

    ent_labels: Iterable[str]
        list of labels to filter entities for in `doc.ents`
    spans_labels: Mapping[str, Iterable[str]]
        Mapping from span group names to list of labels to look for entities
        and assign the predicted entities

However, these a tightly tied to the output format of the component.

as_ents in measurements and dates: whether to export matches as ents instead of just outputing them to a span group

There are in fact two kind of span manipulation that occur at before and after a pipe:

span getters gather spans from multiple sources (from ents, from spans groups, filtered by labels, etc)
spans setters output spans to a given destination (to ents or to spans groups)

The upcoming refacto of edsnlp will allow most rule-based NER components to specify zones where they should look up entities. Following a discussion with @Thomzoy, we will also update components to specify where to output there prediction, i.e. spans or extensions.

Outputs

Here are some suggestions for various rule-based NER components. It seems that the behaviors of such components are too diverse to factorize the span setting parameters.

class DatesAndDurations:
    def __init__(
        self, 
        # value of the .label_ attribute set on dates/durations
        date_label="date", 
        duration_label="duration",
        # name of the span group `.spans[name]` to write matches (with an overwriting behavior)
        to_date_span_group="dates",
        to_duration_span_group="durations",
        # whether to also store matches as standard spaCy `.ents` entities
        to_ents: bool = True,
        # Where to look for candidates, by default the whole document (see below)
        span_getter: Optional[SpanGetter] = None,
    ):
        ...

class MyMatcher:
    def __init__(
        self,
        label: str = "my-custom-match",
        to_ents: bool = True,
        to_span_group: str = "my-custom-matches",
        # or can be overwritten: to_span_group = "my-custom-matches-ml",
        to_ents: bool = True,
        span_getter: Optional[SpanGetter] = None,
    ):
        ...

Inputs

On the other hand, span getters look better suited for factorization. A span getter could be:

a simple string, to look up a span group
```
span_getter = "dates"
```
a list of strings, to look up multiple span groups
```
span_getter = ["dates", "durations"]
```

a more complex/complete configuration, e.g. the one suggested by @Thomzoy

span_getter = dict(
    ents = True,
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)

a callable, to allow for more customizations

def span_getter(doc):
    # do something with the doc
    return spans

In fact, the three first options could automatically be converted into hardcoded callables by the component, so that the component would only have to deal with a callable.

Trainable NER

The trainable NER components are a bit more complex, as we have to deal with both span getters (during training / evaluation) and span setters (during inference / evaluation).

Since the span setting configuration is inferred from the span getting configuration in the current implementation, it would be nice to keep this behavior. Learning from the .ents collection is not desirable, since this field is prone to overwritting, which does not mix well with evaluation and training.

I suggest the span_getter to only allow specifying the span groups to look up, and to infer the span setting configuration from it, as done currently.

The above span_getter configurations could be reused, with less options (no ents, no callable):

target_span_getter = dict(
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
# or
target_span_getter = ["dates", "durations"]
# or 
target_span_getter = "dates"

For span setting, a to_span_groups and to_ents parameters could be used, and be inferred from the training data

# to set "date" labelled matches to the "dates-ml" span group, same for durations
to_span_groups = {
    "dates-ml": "date",
    "durations-ml": ["duration"],
}
# to set all predictions to a single span group
to_span_groups = "ner-predictions"

# and to_ents to set all predictions in ents
to_ents = True
# or to filter by label
to_ents = ["date", "duration"]

Summary

A span getter can be either:

a simple string, to look up a span group
```
span_getter = "dates"
```
a list of strings, to look up multiple span groups
```
span_getter = ["dates", "durations"]
```

a more complex/complete configuration, e.g. the one suggested by @Thomzoy

span_getter = dict(
    ents = True,
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)

a callable, to allow for more customizations

def span_getter(doc):
    # do something with the doc
    return spans

Each component using a span getter should accept some of these configuration, but not necessarily all of them, and convert them to a callable if needed.

Span setters can target .ents or .spans (or both). to_ents-like params can be a mix of:

a boolean, to set all predictions to ents
```
to_ents = True
```
a list of strings, to set all predictions with these labels to ents
```
to_ents = ["date", "duration"]
```

And to_span_groups-like params can be a mix of:

a string, to set all predictions to a single span group
```
to_span_groups = "ner-predictions"
```

a mapping, to set predictions to different span groups

to_span_groups = {
    "dates-ml": "date",
    "durations-ml": ["duration"],
}

Each component using a span setters/getters should accept some of these configuration but not necessarily all of them. It's important that we don't enforce strict modularity or uniformity, as it would make the API too complex and rigid.

aricohen93 commented 1 year ago

As discussed remember to:

rename annotate function to maybe set_spans
always save in span_groups (due to interactions)
in the example of DatesAndDurations the two arguments to_date_span_group and to_duration_span_group could be replaced by a dict in span_group
take in consideration interactions between pipelines (ex. dates and biology)

percevalw commented 1 year ago

Closing as this was merged in #213

aphp / edsnlp