SMI / IsIdentifiable

A tool for detecting identifiable information in data sources (CSV, DICOM, Relational Database and MongoDB)
GNU General Public License v3.0
13 stars 3 forks source link

IsIdentifiable / reviewer performance #539

Open tznind opened 3 years ago

tznind commented 3 years ago

We should look into ways to improve performance currently there are lot of exact match Regular expressions e.g. ^bob$ (in field x). These should be grouped together and simplified into a hashset of strings so we can do fast matching of values into this list rather than sequentially applying every rule/regex one after the other.

jas88 commented 3 years ago

There's a RegexDict package heading for Nuget soon that should help there - conceptually a sort of Dictionary<Regex,T> that knows to turn "^foo$" into a simple dictionary entry internally plus some other tweaks. Probably a week or so from release.

tznind commented 3 years ago

Cool. I think we have a lot of patterns that are generated just with Regex.Escape(...) so might also need to deal with \ (literal space) \( and \. etc (if possible). I can pull a sample of from the ignore rules if that would help.

jas88 commented 3 years ago

Samples would be handy - I'm starting with the simplest three cases, ^foo$, ^foo and foo$ (dictionary, prefix tree and suffix tree respectively), with and without case sensitivity, next up probably turning alternates like (foo|bar) into two entries for foo and bar. Easy to extend support later though, I'll probably focus on the basic cases first.

tznind commented 3 years ago

Looks like you can use Regex.Unescape(String)

Here are some samples:

^MR\ -\ LWS$
^PELVIS\^BG\ -\ Bony\ Pelvis$
^MRI\ BRAIN\ \ WITH\ GENERAL\ ANAESTHETIC$
^MR\ GINOCCHIO\ DX$
^CNPM$
^0000009510$
^MABDO\nMWSPN$
^MLSPC\nMAPEL$
^MHACR\nMHANR$
^MLLCR\nMLOLR$
^XZSCANCDFR$
^MCOWI\nMSKCH$
^MCOWI$
^MCVVS\nMCORV$
^MTHCR\nMTHIR$
^MSKCH\nMNECK$
^MR\ SLDI\ SPINE$
^MCERA$
^MWRCR\nMWRIR$
^MJHHIR$
^MREV$
^MRST$
^MCHEC\nMCHES$
^ZMSKSPNC$
^MCSPC\nMSCTH$
^MCSPN\nMSCTHC$
^MJHSHR$
^MRGK$
^MSHLR\nMELBR$
^ZFNMRN$
^MNECK\nMNECC$
^MAPEL\nMAPEL$
^MADRC$
^MHACL\nMHACR$
^ITSA$
^ZMSKSPN$
^MHCM$
^MRIHBMT$
^DICOM$
^MRIBR$
^MRABDOMEN$
^MRNC$
^MR\ pelvis$
^LSWO$
^IAVC$
^MRI\ KIDN$
^MRKO$
^8CH\ POST\ OP\ FOLL$
^BRAINSPINE$
^NASH_COMPLETE_PR$
^CBT_V11$
^MR_\ BEYIN$
^PELVISLOWEREXTR$
^IRM\ RETROPERITOI$
^8CH\ 72HR\ WYETH\ S$
^8CH\ 30DAY\ WYETH$