aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
113 stars 29 forks source link

Feature request: Unify span getters / setters #203

Closed Thomzoy closed 1 year ago

Thomzoy commented 1 year ago

Feature type

We might want to have a more uniform way of getting spans in pipelines. Currently, we have on_ents_only, on_spans, etc... An idea is to expose a span_getter key in the configuration that could look like:

span_getter = dict(
    ents = True,
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)

If a more complex getter is needed, it could come from a span_getter factory

percevalw commented 1 year ago

To complete this, here are the different uses of such getters in the lib atm:

There are in fact two kind of span manipulation that occur at before and after a pipe:

The upcoming refacto of edsnlp will allow most rule-based NER components to specify zones where they should look up entities. Following a discussion with @Thomzoy, we will also update components to specify where to output there prediction, i.e. spans or extensions.

Outputs

Here are some suggestions for various rule-based NER components. It seems that the behaviors of such components are too diverse to factorize the span setting parameters.

class DatesAndDurations:
    def __init__(
        self, 
        # value of the .label_ attribute set on dates/durations
        date_label="date", 
        duration_label="duration",
        # name of the span group `.spans[name]` to write matches (with an overwriting behavior)
        to_date_span_group="dates",
        to_duration_span_group="durations",
        # whether to also store matches as standard spaCy `.ents` entities
        to_ents: bool = True,
        # Where to look for candidates, by default the whole document (see below)
        span_getter: Optional[SpanGetter] = None,
    ):
        ...

class MyMatcher:
    def __init__(
        self,
        label: str = "my-custom-match",
        to_ents: bool = True,
        to_span_group: str = "my-custom-matches",
        # or can be overwritten: to_span_group = "my-custom-matches-ml",
        to_ents: bool = True,
        span_getter: Optional[SpanGetter] = None,
    ):
        ...

Inputs

On the other hand, span getters look better suited for factorization. A span getter could be:

In fact, the three first options could automatically be converted into hardcoded callables by the component, so that the component would only have to deal with a callable.

Trainable NER

The trainable NER components are a bit more complex, as we have to deal with both span getters (during training / evaluation) and span setters (during inference / evaluation).

Since the span setting configuration is inferred from the span getting configuration in the current implementation, it would be nice to keep this behavior. Learning from the .ents collection is not desirable, since this field is prone to overwritting, which does not mix well with evaluation and training.

I suggest the span_getter to only allow specifying the span groups to look up, and to infer the span setting configuration from it, as done currently.

The above span_getter configurations could be reused, with less options (no ents, no callable):

target_span_getter = dict(
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
# or
target_span_getter = ["dates", "durations"]
# or 
target_span_getter = "dates"

For span setting, a to_span_groups and to_ents parameters could be used, and be inferred from the training data

# to set "date" labelled matches to the "dates-ml" span group, same for durations
to_span_groups = {
    "dates-ml": "date",
    "durations-ml": ["duration"],
}
# to set all predictions to a single span group
to_span_groups = "ner-predictions"

# and to_ents to set all predictions in ents
to_ents = True
# or to filter by label
to_ents = ["date", "duration"]

Summary

A span getter can be either:

Each component using a span getter should accept some of these configuration, but not necessarily all of them, and convert them to a callable if needed.

Span setters can target .ents or .spans (or both). to_ents-like params can be a mix of:

And to_span_groups-like params can be a mix of:

Each component using a span setters/getters should accept some of these configuration but not necessarily all of them. It's important that we don't enforce strict modularity or uniformity, as it would make the API too complex and rigid.

aricohen93 commented 1 year ago

As discussed remember to:

percevalw commented 1 year ago

Closing as this was merged in #213