Closed bdura closed 1 year ago
@percevalw, @Thomzoy, @Aremaki, I'd love to get your thoughts on this!
Great idea, I second you on this. Here are a few points of our IRL discussion
As most end users would either only use the existing attributes (negated
, value
, ...) and not write new ones, we should prefer a simpler notation as it is currently the case, under the base _
getter, but limiting the number of attributes. Therefore as suggested, we could use:
span._.value
for dates, measures, scores, concepts, ..., span._.negation
, span._.rspeech
, etc for more syntax-based attributes. As the norm
attribute is now strongly used throughout the lib as a depolluted ascii version of the text, changing it would probably mean refactoring most of the code. To generate semantically normalized text, we can use the __str__
and __repr__
methods on the generic value
attribute.
Following your dates
revamp, the _.value
extension could inherit of a pydantic.BaseModel
like:
class EDSValue(BaseModel):
def __str__(self): ...
def __repr__(self): ...
...
class Date(EDSValue):
...
class Measure(EDSValue):
...
class Concept(EDSValue):
...
Sounds good! :tada:
As discussed, here is a potential solution that standardizes the current architecture for custom extensions @Thomzoy @aricohen93 Each component can create a Span extension named after the label of the entities it creates:
eds.adicap
creates entities labeled adicap
, and adds an ent._.adicap
extension containing the decoding information eds.tnm
creates entities labeled tnm
, and adds an ent._.tnm
extensioneds.drugs
creates entities labeled drug
, and adds an ent._.drug
extensionA specific ._.value
extension is defined as an aggregator and retrieves the field associated with the label via a getter such that ent._.value == getattr(ent._, ent.label_)
. The str representation of ._.value
could be the one displayed in the demonstrator.
This way, we can keep a consistent typing of each extension (tnm
-> TNMScore
, adicap
-> AdicapCode
, date
-> Date
, ...), while offering a unique entry point for some use cases via the value extension.
This does not prevent to define other extensions if needed, or to keep the old entity extensions and deprecate them in future versions.
@Vincent-Maladiere
Following the discussion with @Thomzoy, we carry on with the approach commented above:
label_
of the entity extracted (if any)value
extensions is defined as the following getter: lambda span: span._.get(span.label_, default=None)
. Having multiple extensions and an aggregator extension allows multiple pipes to modify a single entity, and to prioritize the normalized value of the entity by setting its label — for instance, to choose between the extraction of eds.drugs
(label = drug
), and eds.umls
(label = umls
) — without loosing information between pipesspan1._.value == span2._.value
test runsFor instance, the following (non-exhaustive) modifications should be made:
date
, to match the date
extension (instead of absolute
/relative
since this info is already stored in the span._.date
object)measurement
, and the previous label (e.g. eds.weight
is added to the normalized span._.measurement
objectconsultation_dates
spans will become consultation_date
and the consultation_date
extension will be the extracted datetable
and span._.table
is a getter to span._.to_pd_table(as_values=True)
umls
(this is already the case) and change span._.umls
to a new UMLSConcept(id=the cui, sty=the semantic type)
object
...These suggestions have been integrated in #213
Description
The way we've handled spaCy extensions in EDS-NLP has been erratic at best, with each pipeline declaring its own set of new extensions, cluttering spaCy's
Underscore
object.For instance, the pipeline
eds.dates
,eds.measures
andeds.emergency.priority
all include a parsing component, and each pipeline saves its result in a different attribute.We can clearly do better than that by adopting a more holistic and uniform approach.
Proposition
To start off the discussion, here are a few ideas/questions:
Obect._.edsnlp
master attribute. This would avoid cluttering the extensions, and leave more room for the end users.value
orparsed
.norm
key, that contains the normalised variant for a given entity (eg stripped of pollution, accents, etc). A reasonable idea is to provide the text used for matching. The normalised variant should perhaps be computed on the fly, using a getter function ?edsnlp
extensions, we could use anUnderscore
object. Not sure if that is overkill and/or incurs significant added complexity.