aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
115 stars 29 forks source link

identify tables #9

Closed aricohen93 closed 1 year ago

aricohen93 commented 2 years ago

Suggestion for a new pipeline to detect tables of biological results. To discuss possible improvements to this first example.


import spacy
from edsnlp import components
from io import StringIO

nlp = spacy.blank("fr")

regex = dict(
    tables=[r"(\b.*[|¦].*\n)+",],
)

# Sentencizer component, needed for negation detection
nlp.add_pipe("sentences")
# Matcher component
nlp.add_pipe("matcher", config=dict(regex=regex))

text = """
Le patientqsfqfdf bla bla bla
Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
Hématocrite ¦% ¦44.2 ¦39.2-48.6
VGM ¦fL ¦94.4 + ¦79.6-94
TCMH ¦pg ¦31.6 ¦27.3-32.8
CCMH ¦g/dL ¦33.5 ¦32.4-36.3
Plaquettes ¦x10*9/L ¦191 ¦172-398
VMP ¦fL ¦11.5 + ¦7.4-10.8

qdfsdf

"""

doc = nlp(text)

table_str = doc.spans["tables"][0].text
print(table_str)

table_str = doc.spans["tables"][0].text
table_str

table_str_io = StringIO(table_str)

table_pandas = pd.read_csv(table_str_io, sep="¦", engine="python",header=None)

table_pandas
0 1 2 3
0 Leucocytes x10*9/L 4.97 4.09-11
1 Hématies x10*12/L 4.68 4.53-5.79
2 Hémoglobine g/dL 14.8 13.4-16.7
3 Hématocrite % 44.2 39.2-48.6
4 VGM fL 94.4 + 79.6-94
5 TCMH pg 31.6 27.3-32.8
6 CCMH g/dL 33.5 32.4-36.3
7 Plaquettes x10*9/L 191 172-398
8 VMP fL 11.5 + 7.4-10.8
bdura commented 2 years ago

Great idea @aricohen93.

Do you have time to work on it and propose an actual pipeline component? I figure eds.tables, within edsnlp/pipelines/misc?

bdura commented 2 years ago

@aricohen93 have you put more thought into this ? I suppose we could add this quite easily, perhaps on an "experimental" status?

Aremaki commented 1 year ago

Great idea Mr. Cohen ! I am eager to implement this pipeline ! 👍

ChristelDG commented 1 year ago

Very nice indeed !

I am just testing it, and it works well on example where I just have one table per file, but I have issue with my files that contains several tables, I just get the first one... any ideas on how I could handle them with your pipeline ?

--> Sorry, finally, I just iterate on doc.spans["tables"] and it worked perfectly

Many thanks for this tool !!