aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
112 stars 29 forks source link

Feature request: relieve constraints on non edsnlp custom attributes #220

Closed cvinot closed 9 months ago

cvinot commented 11 months ago

Feature type

Enhance compatibility of EDSnlp custom attributes with potential external pipelines.

Description

in the BaseComponent class, in this commit you added this line :

        Span.set_extension(
            "value",
            getter=lambda span: span._.get(span.label_)
            if span._.has(span.label_)
            else None,
            force=True,
        )`

Doing this, you are enforcing (overwriting if already defined) an attribute which is non edsnlp specific. i.e not named specificaly for your use case, and could be very easily required by anyone using your package as part of a broader pipeline.(see spacy good practices regarding naming components/attributes)

Forcing a getter function means if later on a component were to try to set a value, it would be ignored.

Example of potential conflict:

import spacy
# Setting up a first pipeline
random_nlp = spacy.blank("fr")
random_nlp.add_pipe(
    "eds.terminology", # This component is enforcing the "value" custom attribute
    name="test",
    config=dict(label="Any",terms={}),
)

# Setting up another custom pipeline somewhere else in the code
nlp = spacy.blank("fr")
text = "hello this is a test"
doc = nlp(text)
my_span = doc[0:3]
my_span._.value = "CustomValue" # This raises no error.

assert my_span._.value == "CustomValue" # Error: my_span._.value is None.

WIth my modest experience I would suggest avoiding enforcing attributes in general, but if necessary, renaming the attribute to avoid conflicts. If not possible allowing renaming of the attribute, and/or make sure you let the user know what attributes you are enforcing. (I believe this should be made very clear in the doc specifically in the case of attributes such as "value").

I ran into this conflict upgrading edsnlp from 0.7.4 to 0.9 I might be missing something here and would love to hear what the reasons are if this is absolutely needed.

percevalw commented 11 months ago

Thank you for letting us know! Indeed, this is problematic. This change was made to provide a uniform, generic, access to the normalized entity value (see #47), useful for printing results, exporting as a dataframe, etc, regardless of the span label. I feel we can have the best of both worlds by querying the user_data dict first, and running the getter if the search failed.

In any case, you're right:

cvinot commented 11 months ago

Thanks for the PR link, it does make more sense understanding your approach.

However I still find dubious the use of the "value" extension you are suggesting, as a label getter. It seems you are basically erasing this extension by linking it to "label" which is basically limited to span categories, when it could simply be used to different use cases. I do understand the will to store normalized values.

Why not get directly the "label_" attribute when needed ?

In my use case I see the value extension more like a free entry value which could be set and overridden by different process, sort of how you use the value of your measurements. I run various pipelines, which store information in their respective specific attributes, defined in specifically named attributes, and "decisions" pipeline which decide for a value given the information of the spans. That means the value is not specifically normalized and could be of many shapes including custom models.

I think attributes with strong constraints should be isolated in very specific names to avoid overlapping with more general attributes if that makes sense. I'm personnaly not scared of verbose, we could imagine "span._.label_getter" with the same usecase.

Anyway thanks for explaining your view, I can understand that multiple approaches can be considered.

percevalw commented 11 months ago

In my use case I see the value extension more like a free entry value which could be set and overridden by different process

Agreed, we will soon allow this. But in the case a value has not been set by the user, we will keep the getter behavior.

Why not get directly the "label_" attribute when needed ?

It's true that just typing span._.get(span.label_) would often be sufficient. However, we have multiple use cases in which we list the extensions that we want to see exported / learned / evaluated, etc: with the value extension, just listing it in my_func(extensions=["value", "negation"]) solves this.

That means the value is not specifically normalized and could be of many shapes including custom models.

In our case, a normalized value can also be an instance of a specific class (for instance an AbsoluteDate object or a SimpleMeasurement), or a str if it is enough to semantically describe the extraction.