MegaIng / interegular

Allows to check regexes for overlaps. Based on greenery by @qntm.
MIT License
42 stars 6 forks source link

speed up reversed() via reverse-transition-map #5

Closed lapp0 closed 10 months ago

lapp0 commented 10 months ago

https://github.com/MegaIng/interegular/issues/4

MegaIng commented 10 months ago

I had previously considered these changes IIRC, but rejected them for some reason. Did you check how this performs on many small regex?

lapp0 commented 10 months ago

This PR:

name: ssn
map size: 12
time to reduce: 0.00013375282287597656

name: time
map size: 13
time to reduce: 0.00027179718017578125

name: email
map size: 15
time to reduce: 0.00016760826110839844

name: simple_phone
map size: 17
time to reduce: 0.0003566741943359375

name: ip
map size: 40
time to reduce: 0.0005748271942138672

name: date
map size: 80
time to reduce: 0.0013129711151123047

name: complex_phone
map size: 118
time to reduce: 0.0036478042602539062

name: url
map size: 148
time to reduce: 0.0022287368774414062

Master:

name: ssn
map size: 12
time to reduce: 0.00022172927856445312

name: time
map size: 13
time to reduce: 0.0007364749908447266

name: email
map size: 15
time to reduce: 0.0005176067352294922

name: simple_phone
map size: 17
time to reduce: 0.002007007598876953

name: ip
map size: 40
time to reduce: 0.004317283630371094

name: date
map size: 80
time to reduce: 0.024528980255126953

name: complex_phone
map size: 118
time to reduce: 0.15566778182983398

name: url
map size: 148
time to reduce: 0.13695859909057617

Code:

import interegular
import time

regex_samples = {
    "email": r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?",
    "complex_phone": "\\+?\\d{1,4}?[-.\\s]?\\(?\\d{1,3}?\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,9}",
    "simple_phone": "\\+?[1-9][0-9]{7,14}",
    "date": "([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])|([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])",
    "time": r"(0?[1-9]|1[0-2]):[0-5]\d\s?(am|pm)?",
    "ip": r"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)",
    "url": r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?",
    "ssn": r"\d{3}-\d{2}-\d{4}",
}

def time_reduce(fsm):
    start = time.time()
    fsm.reduce()
    return time.time() - start

def main():
    fsms = {
        name: interegular.parse_pattern(pattern).to_fsm()
        for name, pattern in regex_samples.items()
    }
    for name, fsm in sorted(fsms.items(), key=lambda x: len(x[1].map)):
        print()
        print("name:", name)
        print("map size:", len(fsm.map))
        print("time to reduce:", time_reduce(fsm))
MegaIng commented 10 months ago

Thank you! You might also want to contribute this to greenery if they don't have this already. I have also been planning to switch to using the greenery FSM module and depending on that library, so this optimization might go away again. (They added a different pretty major optimization that I would like to integrate)

lapp0 commented 10 months ago

Will do. Have you considered integrating your extended parsing functionality into greenery? Or is it not possible to do so without breaking fsm-to-pattern reversibility?

MegaIng commented 10 months ago

greenery is not interested in the improvements changes I made, and yes, it definitely isn't reversible, however that by itself might not be a problem.

lapp0 commented 10 months ago

What can I do to help get this version on pypi @MegaIng ?

MegaIng commented 10 months ago

Uploaded as 0.3.3