Closed Thomzoy closed 1 year ago
To complete this, here are the different uses of such getters in the lib atm:
on_ents_only
: in span classification pipes, such as negation or hypothesis to only classify ents instead of all tokens (this is specific to the implementation of the context algorithms), and in dates
to only detect dates inside existing entities (useful for normalization purposes)
on_spans_groups
/on_ents
: used in span qualifier to retrieve the list of ents / spans that should be classified
on_ents: Union[bool, Sequence[str]]
Whether to look into `doc.ents` for spans to classify. If a list of strings
is provided, only the span of the given labels will be considered. If None
and `on_span_groups` is False, labels mentioned in `label_constraints`
will be used.
on_span_groups: Union[bool, Sequence[str], Mapping[str, Sequence[str]]]
Whether to look into `doc.spans` for spans to classify:
- If True, all span groups will be considered
- If False, no span group will be considered
- If a list of str is provided, only these span groups will be kept
- If a mapping is provided, the keys are the span group names and the values
are either a list of allowed labels in the group or True to keep them all
ent_labels
/ span_labels
: in trainable NER pipe to retrieve the list of ents / spans that should be extracted
ent_labels: Iterable[str]
list of labels to filter entities for in `doc.ents`
spans_labels: Mapping[str, Iterable[str]]
Mapping from span group names to list of labels to look for entities
and assign the predicted entities
However, these a tightly tied to the output format of the component.
as_ents
in measurements and dates: whether to export matches as ents instead of just outputing them to a span group
There are in fact two kind of span manipulation that occur at before and after a pipe:
The upcoming refacto of edsnlp will allow most rule-based NER components to specify zones where they should look up entities. Following a discussion with @Thomzoy, we will also update components to specify where to output there prediction, i.e. spans or extensions.
Here are some suggestions for various rule-based NER components. It seems that the behaviors of such components are too diverse to factorize the span setting parameters.
class DatesAndDurations:
def __init__(
self,
# value of the .label_ attribute set on dates/durations
date_label="date",
duration_label="duration",
# name of the span group `.spans[name]` to write matches (with an overwriting behavior)
to_date_span_group="dates",
to_duration_span_group="durations",
# whether to also store matches as standard spaCy `.ents` entities
to_ents: bool = True,
# Where to look for candidates, by default the whole document (see below)
span_getter: Optional[SpanGetter] = None,
):
...
class MyMatcher:
def __init__(
self,
label: str = "my-custom-match",
to_ents: bool = True,
to_span_group: str = "my-custom-matches",
# or can be overwritten: to_span_group = "my-custom-matches-ml",
to_ents: bool = True,
span_getter: Optional[SpanGetter] = None,
):
...
On the other hand, span getters look better suited for factorization. A span getter could be:
span_getter = "dates"
span_getter = ["dates", "durations"]
span_getter = dict(
ents = True,
spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
def span_getter(doc):
# do something with the doc
return spans
In fact, the three first options could automatically be converted into hardcoded callables by the component, so that the component would only have to deal with a callable.
The trainable NER components are a bit more complex, as we have to deal with both span getters (during training / evaluation) and span setters (during inference / evaluation).
Since the span setting configuration is inferred from the span getting configuration in the current implementation, it would be nice to keep this behavior. Learning from the .ents
collection is not desirable, since this field is prone to overwritting, which does not mix well with evaluation and training.
I suggest the span_getter to only allow specifying the span groups to look up, and to infer the span setting configuration from it, as done currently.
The above span_getter configurations could be reused, with less options (no ents, no callable):
target_span_getter = dict(
spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
# or
target_span_getter = ["dates", "durations"]
# or
target_span_getter = "dates"
For span setting, a to_span_groups
and to_ents
parameters could be used, and
be inferred from the training data
# to set "date" labelled matches to the "dates-ml" span group, same for durations
to_span_groups = {
"dates-ml": "date",
"durations-ml": ["duration"],
}
# to set all predictions to a single span group
to_span_groups = "ner-predictions"
# and to_ents to set all predictions in ents
to_ents = True
# or to filter by label
to_ents = ["date", "duration"]
A span getter can be either:
span_getter = "dates"
span_getter = ["dates", "durations"]
span_getter = dict(
ents = True,
spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
def span_getter(doc):
# do something with the doc
return spans
Each component using a span getter should accept some of these configuration, but not necessarily all of them, and convert them to a callable if needed.
Span setters can target .ents
or .spans
(or both). to_ents
-like params can be a mix of:
to_ents = True
to_ents = ["date", "duration"]
And to_span_groups
-like params can be a mix of:
to_span_groups = "ner-predictions"
to_span_groups = {
"dates-ml": "date",
"durations-ml": ["duration"],
}
Each component using a span setters/getters should accept some of these configuration but not necessarily all of them. It's important that we don't enforce strict modularity or uniformity, as it would make the API too complex and rigid.
As discussed remember to:
DatesAndDurations
the two arguments to_date_span_group
and to_duration_span_group
could be replaced by a dict in span_groupClosing as this was merged in #213
Feature type
We might want to have a more uniform way of getting spans in pipelines. Currently, we have
on_ents_only
,on_spans
, etc... An idea is to expose aspan_getter
key in the configuration that could look like:If a more complex getter is needed, it could come from a
span_getter
factory