aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
113 stars 29 forks source link

Adicap : enhancement of regex to match local spelling #158

Closed GuillaumePressiat closed 1 year ago

GuillaumePressiat commented 2 years ago

Description

In my hospital (CHU de Brest), ADICAP codes are written like this:


ADICAP :B.H.HP.A7A0

Cotations :
ZZQX217      R-AHC-100-A001 R-AHC-10-A015

In this case dots spells adicap structure and dictionnaries for (d1-d8) part of code.

Your regex in adicap ner is without dots, here

Are you ok if I propose this modified regex?

just add 3 conditionnal dots \.{0,1} in d1_4 = r"[A-Z]\.{0,1}[A-Z]\.{0,1}[A-Z]{2}\.{0,1}"

d1_4 = r"[A-Z]\.{0,1}[A-Z]\.{0,1}[A-Z]{2}\.{0,1}"
d5_8_v1 = r"\d{4}"
d5_8_v2 = r"\d{4}|[A-Z][0-9A-Z][A-Z][0-9]"
d5_8_v3 = r"[0-9A-Z][0-9][09A-Z][0-9]"
d5_8_v4 = r"0[A-Z][0-9]{2}"

adicap_prefix = r"(?i)(codification|adicap)"
base_code = (
    r"("
    + d1_4
    + r"(?:"
    + d5_8_v1
    + r"|"
    + d5_8_v2
    + r"|"
    + d5_8_v3
    + r"|"
    + d5_8_v4
    + r"))"
)

test :

image

Many thanks

percevalw commented 2 years ago

Hi @GuillaumePressiat, thanks for this feedback! Of course, feel free to make a PR to improve this pattern!

GuillaumePressiat commented 2 years ago

Hi @percevalw, thanks for your answer. I will do it a bit later.

Nice module by the way. (I was at "DIM siege AP-HP" when nlp-segmenter and uima pipelines were developed (@parisni and co.) and I had tried to participate few improvements on sections definition (edition of file sections.csv)).

percevalw commented 2 years ago

I see! The eds.section extraction module was partly inspired by earlier work at APHP's EDS, maybe you can find some of your contributions there :)

etienneguevel commented 1 year ago

Hi @GuillaumePressiat! Thanks for the feedback :) I think that the regex you mentionned should do the trick, but the newly detected ADICAP codes will not be decoded correctly. Modifying the decode function in the class ADICAP like:

def decode(self, code):

        code = code.replace(".", "")
        exploded = list(code) 
        adicap = AdicapCode(
            code=code,
            sampling_mode=self.decode_dict["D1"]["codes"].get(exploded[0]),
            technic=self.decode_dict["D2"]["codes"].get(exploded[1]),
            organ=self.decode_dict["D3"]["codes"].get("".join(exploded[2:4])),
        )

        for d in ["D4", "D5", "D6", "D7"]:
            adicap_short = self.decode_dict[d]["codes"].get("".join(exploded[4:8]))
            adicap_long = self.decode_dict[d]["codes"].get("".join(exploded[2:8]))

            if (adicap_short is not None) | (adicap_long is not None):
                adicap.pathology = self.decode_dict[d]["label"]
                adicap.behaviour_type = self.decode_dict[d]["codes"].get(exploded[5])

                if adicap_short is not None:
                    adicap.pathology_type = adicap_short

                else:
                    adicap.pathology_type = adicap_long

        return adicap

should solve this issue!

GuillaumePressiat commented 1 year ago

Hello @etienneguevel,

thanks for the tip!

I've modified the two scripts here (patterns.py and adicap.py). And then install edsnlp as mentioned in the docs. When I try to detect Codification : B.H.HP.A7A0 for instance it doesn't work yet.

image

Today I doesn't see where the thing is. Can somebody take a look and help me?

Many thanks

etienneguevel commented 1 year ago

Hello @GuillaumePressiat, You're welcome!

I've looked for the reasons your modifications didn't lead to the expected results, and found that there is an issue between the model used for the ADICAP pipeline (eds.contextual-matcher) and the way the edsnlp sentencizer cuts the ADICAP codes like "B.H.HP.A7A0".

The model used look like this :

import spacy
info = dict(
    source="adicap",
    regex=r"(?i)(codification|adicap)",
    regex_attr="TEXT",
    assign=[
        dict(
            name="code",
            regex=base_code,
            window=(-100,100),
            replace_entity=True,
            reduce_mode=None,
        ),
    ]
)

nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")

nlp.add_pipe("eds.contextual-matcher",
    name="adicap",
    config=dict(
        patterns = [info]
    ),
)
print(nlp("Codification : B.H.HP.A7A0").ents)
()

print(nlp("ADICAP: B.H.HP.A7A0".replace(".", "")).ents)
(BHHPA7A0,)

I've made an issue to describe how the sentencizer deal with codes like your ADICAP example at : https://github.com/aphp/edsnlp/issues/178

The eds.sentences pipelines is currently being reworked (https://github.com/aphp/edsnlp/pull/177), and there should be a modification that would solve the explosion of the ADICAP codes into several sentences.

GuillaumePressiat commented 1 year ago

Hello @etienneguevel,

Thank you for the feedback! It's quite logical indeed.

For now I've just removed all dots in my anapath documents and the basic eds.adicap pipeline works just fine.

Thanks for the other issue related to this (eds.sentences cutting codes in different sentences). I will follow this!

Guillaume

percevalw commented 1 year ago

Hi @GuillaumePressiat, the ADICAP matcher should now work (in the master branch) without having to modify the text upstream. Please let us know if you still have issues with this component ! :)

GuillaumePressiat commented 1 year ago

Hi @percevalw,

It's ok now!

Capture d’écran 2023-03-07 à 20 40 03

Thank you very much!

Guillaume