CogStack / MedCAT

Medical Concept Annotation Tool
Other
456 stars 105 forks source link

CU-8695ucw9b deid transformers fix #490

Closed mart-r closed 1 month ago

mart-r commented 2 months ago

Since transformers==4.42.0, the tokenizer being loaded is expected to have the split_special_tokens attribute. However, the version we've saved (and load) won't have that attribute. So the processing fails (an exceptions is raised). This failure (along with the exception) is logged. But the overall process is never halted.

What this PR does:

tomolopolis commented 2 months ago

Task linked: CU-8695ucw9b Fix issue with DeID models and transformers 4.42+

mart-r commented 2 months ago

why not just raise the exception and halt the process?

My best guess is that at some point in the past, something in the try-except block was raising an exception due to something that was unable to be fixed (perhaps some kind of undefined characters that caused spacy to fail? I don't know) and thus all the exceptions were being caught to allow the rest of the pipeline to still keep working.

The reason I left in the massive try-except block is because there's potentially users that rely on it.

Though in principle, I agree. The exception should be raised since otherwise (most likely) the DeID pipe will most likely simply not do anything. And personal information will be revealed where it shouldn't.

I will do the following:

And then I'll report back

mart-r commented 2 months ago

Went through the DeID of a document that had some characters in the middle of the target text that could potentially cause issues.

And the result was:

As such, I don't really see any reason for the try-except to exist and I removed it. So next time we support a transformers version that doesn't work with the saved model, we'll see the crash - and reason for it - immediately.