BeRo1985 / flre

FLRE - Fast Light Regular Expressions - A fast light regular expression library
GNU Lesser General Public License v2.1
94 stars 23 forks source link

Slow unicode classes #24

Open benibela opened 8 years ago

benibela commented 8 years ago

The following expression to check a character for some unicode classes

TFLRE.Create('^[^\p{Nd}\p{Nl}\p{No}\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}]$',[rfUTF8]);

takes 200ms on my computer.

That does not seem normal.

benibela commented 8 years ago

All I was trying to do is to check if a single character is in some Unicode class.

TFLRE.Create('^[\p{Nd}\p{Nl}\p{No}\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}]$',[rfUTF8]);

would work as well, with the check negated at the calling site. However, it is just as slow.

Next I tried '^(\p{Nd}|\p{Nl}|\p{No}|\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo})$', but surprisingly that is even slower and takes 500ms here

BeRo1985 commented 8 years ago

Hm, it looks that I must optimize TFLREUnicodeCharClass.Invert or something TFLREUnicodeCharClass-related in that direction.

It's queued on my ToDo list after "Implement thread safe dynamic array class into PasMP" (which's nearly done) and "Port the Vulkan headers to Pascal" (a one-night-task for the next night)

benibela commented 8 years ago

But queue it after #10 and #22

pyscripter commented 6 years ago
   StopWatch := TStopWatch.StartNew;
     for I := 1 to 10 do begin
       FLREInstance:=TFLRE.Create('^(\p{Nd}|\p{Nl}|\p{No}|\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo})$',[rfUTF8]);
       FLREInstance.Destroy;
     end;
   WriteLn(StopWatch.Elapsed.TotalSeconds:10:2);

Delphi 64 bit Fast I7 processor, 2 secs. The generated Regular Expression (by DumpRegularExpression) is 8747 characters long! It contains 221 times \x80 28 times \x85 etc. With rfONLYFASTOPTIMIZATIONS the Regular Expressions is 31024 long (!), but the time drops to 0.6 seconds. A good example of the optimization shrinking the regular expression down by 75%.

with `TFLRE.Create('^[\p{Nd}\p{Nl}\p{No}\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}]$',[rfUTF8]); 0.6 secs.

Still a bit of an issue.

pyscripter commented 6 years ago

I think the first expression is equivalent to: FLREInstance:=TFLRE.Create('^(\p{N}|\p{L})$',[rfUTF8]); which brings time down to 0.6 secs (for 10 iterations) (same size), which I think is OK,
And matching is still quite fast.