aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
112 stars 29 forks source link

Feature request: efficient fuzzy matching #38

Closed StanislasDOZIAS closed 2 years ago

StanislasDOZIAS commented 2 years ago

Feature type

Some performance comparisons between different matching algorithms (SimString, FlashText and EDS.Matcher)

Description

SimString

Search with a given measure (for example Cosine Similarity) if a string match another string. It is also the base of QuickUMLS. Quite annoying to use (you have to split the text in order to have a list of query word/expressions).

FlashText

Build a Tree with all the words in the dictionary, and go through it with the query tokens. Built to find exact match. Possibilitie to allow some misspeling through the Levensthein's Distance, already implemented in the GitHub but not in the pip version, so the code is in edsnlp\u tils\flashtext.py.

Global Comparison with SimString, FlashText (with and without a spaCy pipeline) and the EDS.matcher pipeline :

FTspaCy is the FlashText algorithm included in a spaCy pipeline to include all the computation added by it.

Making the number of keywords evolves (we search them in 10k words)

EXACT COMPETITION
Count | SimString | FlashText | FTspaCy | EDSmatcher |
------------------------------------------------------
0     | 0.16901   | 0.01900   | 0.88706 | 0.78106    |
500   | 0.29502   | 0.02600   | 0.83106 | 1.32510    |
1000  | 0.28202   | 0.02800   | 0.78306 | 1.84014    |
1500  | 0.32902   | 0.03000   | 0.83706 | 2.64216    |
2000  | 0.21301   | 0.03200   | 0.81806 | 3.09923    |
2500  | 0.35903   | 0.03700   | 0.76006 | 3.44325    |
FUZZY COMPETITION
Count | SimString | FlashText | FTspaCy |
-----------------------------------------
0     | 1.16013   | 0.05101   | 0.89707 |
500   | 1.04108   | 2.24417   | 2.92714 |
1000  | 1.05508   | 2.32512   | 3.05924 |
1500  | 1.03408   | 3.09474   | 3.92729 |
2000  | 1.21609   | 2.06315   | 2.95722 |
2500  | 1.10708   | 2.16816   | 2.96522 |

Making the number of total words evolves (we search 2587 keywords)

EXACT COMPETITION
Count | SimString | FlashText | FTspaCy | EDSmatcher |
------------------------------------------------------
10000 | 0.39403   | 0.04200   | 1.17209 | 4.73035    |
20000 | 0.59956   | 0.06701   | 1.83514 | 8.09860    |
30000 | 0.71705   | 0.08101   | 2.53819 | 12.29189   |
40000 | 0.94507   | 0.14701   | 4.00055 | 16.79729   |

FUZZY COMPETITION

Count | SimString | FlashText |  FTspaCy  |
-------------------------------------------
10000 | 1.35110   | 2.88821   | 4.91637   |
20000 | 2.21816   | 6.73250   | 7.80158   |
30000 | 2.33017   | 9.09268   | 12.11979  |
40000 | 3.31178   | 11.93089  | 14.07093  |
bdura commented 2 years ago

This would be a nice addition to the EDSNLP stack! @percevalw is currently working on a re-write of the PhraseMatcher, I think we'll wait until this change is ready before tackling fuzzy matching. But this is definitely on our radar, and the comparison helps a lot :)

percevalw commented 2 years ago

Since https://github.com/aphp/edsnlp/pull/43 and https://github.com/aphp/edsnlp/pull/94 have been merged, I think we can close this one.
Future work on this topic will likely focus on speeding up simstring.