SMI / IsIdentifiable

A tool for detecting identifiable information in data sources (CSV, DICOM, Relational Database and MongoDB)
GNU General Public License v3.0
14 stars 3 forks source link

Improve spaCy performance #154

Open howff opened 2 years ago

howff commented 2 years ago

As seen in https://github.com/SMI/IsIdentifiable/pull/151#issuecomment-1210728465 then NER component of spaCy has changed from v2 to v3. With the sample test string "We are taking John to Queen Margaret Hospital today." you would expect two or three elements: John, Queen Margaret Hospital, and today. Alternatively, just Queen Margaret or Margaret Hospital would suffice. However v3 only found "today".

It turns out that the particular test string chosen was only correct in v2 due to good luck, and only wrong in v3 due to bad luck.

A test program has been written to try "We are taking X to Y today." for various combinations of X and Y, in both spacy v2 and v3, and in several language models. The results show that it's very sensitive to the particular text of X or Y (actually, X and Y), which is not intuitive, as we might have expected any X or Y to make grammatical sense, even nonsense names.

The best performance now comes with using spaCy v3, and the en_core_web_trf language model, although as this is transformer-based it requires additional python modules including pytorch, and so would undoubtedly work much faster with a GPU. SpaCy v2 with the en_core_web_lg model is also quite good, but has some unusual behaviour such as producing the single entity "John to Y Hospital" instead of two separate entities.

We should consider whether we change to v2 and the lg model, or v3 and the trf model. This would require an upgrade to the live pipeline in the safe haven.

tznind commented 2 years ago

Thanks for digging into this so thouroughly. Sounds like there are advantages and disadvantages to each. And performance may be heavily dependent on the data being run through it.

Fortunately IsIdentifiable can run multiple versions of NER without recompilation. And we can even run multiple at once if helpful (either in parallel or with ConsensusRule).

But I'd like to start by documenting how to setup each of those options. If the testing script is problematic we can always adjust it to just expect 1+ classifications for example.

I'll start by updating my docs PR based on your feedback.

I see that @jas has added the -d option for running with a specific language file (see #153). Maybe we can beef up the script more so that the daemon can run as one file or another more easily. But we need to make sure all dependencies are clearly included in our docs for new users of the tool.

The docs should be from the perspective of a new user who just wants to run on some CSVs / DICOMs etc. I've been telling our data analysts about how this would simplify validation (currently one analyst is having to review 100 datasets extracted as part of GOFUSION / GODARTS) and is eyeballing the data manually 😮

jas88 commented 2 years ago

Q: How does this compare to Stanford NER in the Java nerd we use at present? Is there a pressing need to switch to/add spaCy?

rkm commented 2 years ago

This is fairly low priority at the moment seeing as we're not actively using Spacy in production, so let's leave this on the backlog.