Closed cvinot closed 11 months ago
Thank you for letting us know! Indeed, this is problematic. This change was made to provide a uniform, generic, access to the normalized entity value (see #47), useful for printing results, exporting as a dataframe, etc, regardless of the span label.
I feel we can have the best of both worlds by querying the user_data
dict first, and running the getter if the search failed.
In any case, you're right:
force
should be FalseThanks for the PR link, it does make more sense understanding your approach.
However I still find dubious the use of the "value" extension you are suggesting, as a label getter. It seems you are basically erasing this extension by linking it to "label" which is basically limited to span categories, when it could simply be used to different use cases. I do understand the will to store normalized values.
Why not get directly the "label_" attribute when needed ?
In my use case I see the value extension more like a free entry value which could be set and overridden by different process, sort of how you use the value of your measurements. I run various pipelines, which store information in their respective specific attributes, defined in specifically named attributes, and "decisions" pipeline which decide for a value given the information of the spans. That means the value is not specifically normalized and could be of many shapes including custom models.
I think attributes with strong constraints should be isolated in very specific names to avoid overlapping with more general attributes if that makes sense. I'm personnaly not scared of verbose, we could imagine "span._.label_getter" with the same usecase.
Anyway thanks for explaining your view, I can understand that multiple approaches can be considered.
In my use case I see the value extension more like a free entry value which could be set and overridden by different process
Agreed, we will soon allow this. But in the case a value has not been set by the user, we will keep the getter behavior.
Why not get directly the "label_" attribute when needed ?
It's true that just typing span._.get(span.label_)
would often be sufficient. However, we have multiple use cases in which we list the extensions that we want to see exported / learned / evaluated, etc: with the value
extension, just listing it in my_func(extensions=["value", "negation"])
solves this.
That means the value is not specifically normalized and could be of many shapes including custom models.
In our case, a normalized value can also be an instance of a specific class (for instance an AbsoluteDate
object or a SimpleMeasurement
), or a str if it is enough to semantically describe the extraction.
Feature type
Enhance compatibility of EDSnlp custom attributes with potential external pipelines.
Description
in the BaseComponent class, in this commit you added this line :
Doing this, you are enforcing (overwriting if already defined) an attribute which is non edsnlp specific. i.e not named specificaly for your use case, and could be very easily required by anyone using your package as part of a broader pipeline.(see spacy good practices regarding naming components/attributes)
Forcing a getter function means if later on a component were to try to set a value, it would be ignored.
Example of potential conflict:
WIth my modest experience I would suggest avoiding enforcing attributes in general, but if necessary, renaming the attribute to avoid conflicts. If not possible allowing renaming of the attribute, and/or make sure you let the user know what attributes you are enforcing. (I believe this should be made very clear in the doc specifically in the case of attributes such as "value").
I ran into this conflict upgrading edsnlp from 0.7.4 to 0.9 I might be missing something here and would love to hear what the reasons are if this is absolutely needed.