Closed lapp0 closed 10 months ago
I had previously considered these changes IIRC, but rejected them for some reason. Did you check how this performs on many small regex?
This PR:
name: ssn
map size: 12
time to reduce: 0.00013375282287597656
name: time
map size: 13
time to reduce: 0.00027179718017578125
name: email
map size: 15
time to reduce: 0.00016760826110839844
name: simple_phone
map size: 17
time to reduce: 0.0003566741943359375
name: ip
map size: 40
time to reduce: 0.0005748271942138672
name: date
map size: 80
time to reduce: 0.0013129711151123047
name: complex_phone
map size: 118
time to reduce: 0.0036478042602539062
name: url
map size: 148
time to reduce: 0.0022287368774414062
Master:
name: ssn
map size: 12
time to reduce: 0.00022172927856445312
name: time
map size: 13
time to reduce: 0.0007364749908447266
name: email
map size: 15
time to reduce: 0.0005176067352294922
name: simple_phone
map size: 17
time to reduce: 0.002007007598876953
name: ip
map size: 40
time to reduce: 0.004317283630371094
name: date
map size: 80
time to reduce: 0.024528980255126953
name: complex_phone
map size: 118
time to reduce: 0.15566778182983398
name: url
map size: 148
time to reduce: 0.13695859909057617
Code:
import interegular
import time
regex_samples = {
"email": r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?",
"complex_phone": "\\+?\\d{1,4}?[-.\\s]?\\(?\\d{1,3}?\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,9}",
"simple_phone": "\\+?[1-9][0-9]{7,14}",
"date": "([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])|([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])",
"time": r"(0?[1-9]|1[0-2]):[0-5]\d\s?(am|pm)?",
"ip": r"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)",
"url": r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?",
"ssn": r"\d{3}-\d{2}-\d{4}",
}
def time_reduce(fsm):
start = time.time()
fsm.reduce()
return time.time() - start
def main():
fsms = {
name: interegular.parse_pattern(pattern).to_fsm()
for name, pattern in regex_samples.items()
}
for name, fsm in sorted(fsms.items(), key=lambda x: len(x[1].map)):
print()
print("name:", name)
print("map size:", len(fsm.map))
print("time to reduce:", time_reduce(fsm))
Thank you! You might also want to contribute this to greenery if they don't have this already. I have also been planning to switch to using the greenery FSM module and depending on that library, so this optimization might go away again. (They added a different pretty major optimization that I would like to integrate)
Will do. Have you considered integrating your extended parsing functionality into greenery? Or is it not possible to do so without breaking fsm-to-pattern reversibility?
greenery is not interested in the improvements changes I made, and yes, it definitely isn't reversible, however that by itself might not be a problem.
What can I do to help get this version on pypi @MegaIng ?
Uploaded as 0.3.3
https://github.com/MegaIng/interegular/issues/4