identify tables - Githubissues

aricohen93 commented 2 years ago

Suggestion for a new pipeline to detect tables of biological results. To discuss possible improvements to this first example.


import spacy
from edsnlp import components
from io import StringIO

nlp = spacy.blank("fr")

regex = dict(
    tables=[r"(\b.*[|¦].*\n)+",],
)

# Sentencizer component, needed for negation detection
nlp.add_pipe("sentences")
# Matcher component
nlp.add_pipe("matcher", config=dict(regex=regex))

text = """
Le patientqsfqfdf bla bla bla
Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
Hématocrite ¦% ¦44.2 ¦39.2-48.6
VGM ¦fL ¦94.4 + ¦79.6-94
TCMH ¦pg ¦31.6 ¦27.3-32.8
CCMH ¦g/dL ¦33.5 ¦32.4-36.3
Plaquettes ¦x10*9/L ¦191 ¦172-398
VMP ¦fL ¦11.5 + ¦7.4-10.8

qdfsdf

"""

doc = nlp(text)

table_str = doc.spans["tables"][0].text
print(table_str)

table_str = doc.spans["tables"][0].text
table_str

table_str_io = StringIO(table_str)

table_pandas = pd.read_csv(table_str_io, sep="¦", engine="python",header=None)

table_pandas

	0	1	2	3
0	Leucocytes	x10*9/L	4.97	4.09-11
1	Hématies	x10*12/L	4.68	4.53-5.79
2	Hémoglobine	g/dL	14.8	13.4-16.7
3	Hématocrite	%	44.2	39.2-48.6
4	VGM	fL	94.4 +	79.6-94
5	TCMH	pg	31.6	27.3-32.8
6	CCMH	g/dL	33.5	32.4-36.3
7	Plaquettes	x10*9/L	191	172-398
8	VMP	fL	11.5 +	7.4-10.8

bdura commented 2 years ago

Great idea @aricohen93.

Do you have time to work on it and propose an actual pipeline component? I figure eds.tables, within edsnlp/pipelines/misc?

bdura commented 2 years ago

@aricohen93 have you put more thought into this ? I suppose we could add this quite easily, perhaps on an "experimental" status?

Aremaki commented 1 year ago

Great idea Mr. Cohen ! I am eager to implement this pipeline ! 👍

ChristelDG commented 1 year ago

Very nice indeed !

I am just testing it, and it works well on example where I just have one table per file, but I have issue with my files that contains several tables, I just get the first one... any ideas on how I could handle them with your pipeline ?

--> Sorry, finally, I just iterate on doc.spans["tables"] and it worked perfectly

Many thanks for this tool !!

aphp / edsnlp

identify tables #9