Available for as a plugin for Visual Studio Code; and as libraries for Python, JavaScript, Typescript, and Rust.
Regular Expressions are one of the best ideas in the field of software. However, their $#%! syntax is an accident from 1968. Kleene Expressions (after mathematician Stephen Kleene who discovered regex) are a drop-in replacement syntax that compiles to languages' native regex libraries, promising full bug-for-bug API compatibility.
Now 100% less painful to migrate! (You heard that right: migration is not painful at all.)
.
and ()
.)pip install kleenexp
Now just write ke
wherever you used to write re
:
import ke
username = input('Choose a username:')
# if not re.match(r'[A-Za-z][A-Za-z\d]*$', password):
if not ke.match('[#letter [0+ [#letter | #digit]] #end_string]', username):
print("Invalid username")
else:
password = input('Enter a new password:')
# if re.match(r'\A(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=\D*(\d))(?!.*(?:123|pass|Pass))\w{6,}\Z', password):
if not ke.match('''[
#has_lower=[lookahead [0+ not #lowercase] #lowercase]
#has_upper=[lookahead [0+ not #uppercase] #uppercase]
#has_digit=[lookahead [0+ not #digit] [capture #digit]]
#no_common=[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
#start_string #has_lower #has_upper #has_digit #no_common [6+ #token_character] #end_string
]''', password):
print("Password should have at least one uppercase letter, one lowercase, one digit.")
print("And nothing obvious like '123'")
else:
...
Be sure to read the tutorial below!
Kleenexp:
Hello. My name is Inigo Montoya. You killed my Father. Prepare to die.
Regex:
Hello\. My name is Inigo Montoya\. You killed my Father\. Prepare to die\.
Kleenexp:
[1-3 'What is your ' ['name' | 'quest' | 'favourite colour'] '?' [0-1 #space]]
Regex:
(?:What is your (?:name|quest|favourite colour)\?)\s?){1,3}
Kleenexp:
Hello. My name is [capture:name #tmp ' ' #tmp #tmp=[#uppercase [1+ #lowercase]]]. You killed my ['Father' | 'Mother' | 'Hamster']. Prepare to die.
Regex:
Hello\. My name is (?<name>[A-Z][a-z]+ [A-Z][a-z]+)\. You killed my (?:Father|Mother|Hamster)\. Prepare to die\.`
Or, if you're in a hurry, you can use the shortened form:
Hello. My name is [c:name#uc[1+#lc]' '#uc[1+#lc]]. You killed my ['Father'|'Mother'|'Hamster']. Prepare to die.
(And when you're done, you can use our automatic tool -- in development -- to convert the short Kleenexp to the more readable version, and commit that instead.)
( Print )
More on the syntax, additional examples, and the design criteria that led to its design, below.
This is not a toy project meant to prove a technological point. This is a serious attempt to fix something that is broken in the software ecosystem and has been broken since before we were born. We have experience running R&D departments, we understand how technology decisions are made, and we realise success here hinges more on "growth hacking" and on having a painless and risk-free migration story than it does on technical excellence.
Step 1 is to introduce Kleenexp to the early adopter developer segment by releasing great plugins for popular text editors like Visual Studio Code, with better UX (syntax highlighting, autocompletion, good error messages, ...) and a great tutorial. Adopting a new regex syntax for your text editor is low-risk and requires no coordination between stakeholders.
Step 2 is to aim at hobbyist software projects by making our JavaScript adoption story as painless and risk-free as possible (since JavaScript has the most early-adopting and fast-moving culture). In addition to a runtime drop-in syntax adapter, we will write a Babel plugin that translates Kleenexp syntax into legacy regex syntax at compile time, to enable zero-overhead usage.
Step 3 is to aim at startups by optimizing and testing the implementations until they're ready for deployment in big-league production scenarios.
Step 4 is to make it possible for large legacy codebases to switch by releasing tools that automatically convert a codebase from legacy syntax to Kleenexp (like python's 2to3
or AirBnB's ts-migrate
)
pip install
-able, usable in Python projects with import ke as re
and in grep
with grep -P `ke "pattern"`
Babel
plugin enables JavaScript projects to use the new syntax with 0 runtime overhead2to3
-based tool to automatically migrate Python projects to use new syntax, similar tool for JavaScriptKleene Expressions are named after mathematician Stephen Kleene who invented regular expressions.
Wikipedia says:
Although his last name is commonly pronounced /ˈkliːni/ KLEE-nee or /kliːn/ KLEEN, Kleene himself pronounced it /ˈkleɪni/ KLAY-nee. His son, Ken Kleene, wrote: "As far as I am aware this pronunciation is incorrect in all known languages. I believe that this novel pronunciation was invented by my father."
However, with apologies to the late Dr. Kleene, "Kleene expressions" is pronounced "Clean expressions" and not "Klein expressions."
Removing parenthesis:
import ke
def remove_parentheses(line):
if ke.search("[#open=['('] #close=[')'] #open [0+ not #close] #open]", line):
raise ValueError()
return ke.sub("[ '(' [0+ not ')'] ')' ]", '', line)
assert remove_parentheses('a(b)c(d)e') == 'ace'
The original with regex is from a hackathon project I participated in, and looks like this:
import re
def remove_parentheses(line):
if re.search(r'\([^)]*\(', line):
raise ValueError()
return re.sub(r'\([^)]*\)', '', line)
assert remove_parentheses('a(b)c(d)e') == 'ace'
For replacement with sub()
, the syntax for the replacement is the same as for regexes.
import ke
assert ke.sub("[[capture '.' [6 #digit] ] [0+ #digit] ]", r"\1", "3.14159265359") == "3.141592"
assert ke.sub("Hi [capture:name 1+ #letter]!", r"\g<name> \g<name>!", "Hi Bobby!") == "Bobby Bobby!"
Another example, rewriting paths in Django:
import ke
from django.urls import path, re_path
from . import views
urlpatterns = [
path('articles/2003/', views.special_case_2003),
re_path(ke.re("[#start_line]articles/[capture:year 4 #digit]/[#end_line]"), views.year_archive),
re_path(ke.re("[#start_line]articles/[capture:year 4 #digit]/[capture:month 2 #digit]/[#end_line]"),views.month_archive),
re_path(ke.re(
"[#start_line]articles/[capture:year 4 #digit]/[capture:month 2 #digit]/[capture:slug 1+ [#letter | #digit | '_' | '-']]/[#end_line]"), views.article_detail),
]
The original with regex is taken from Django documentation and looks like this:
from django.urls import path, re_path
from . import views
urlpatterns = [
path('articles/2003/', views.special_case_2003),
re_path(r'^articles/(?P<year>[0-9]{4})/$', views.year_archive),
re_path(r'^articles/(?P<year>[0-9]{4})/(?P<month>[0-9]{2})/$', views.month_archive),
re_path(r'^articles/(?P<year>[0-9]{4})/(?P<month>[0-9]{2})/(?P<slug>[\w-]+)/$', views.article_detail),
]
Anything outside of brackets is a literal:
This is a (short) literal :-)
You can use macros like #digit
(short: #d
) or #any
(#a
):
This is a [#lowercase #lc #lc #lc] regex :-)
You can repeat with n
, n+
or n-m
:
This is a [1+ #lc] regex :-)
If you want either of several options, use |
:
This is a ['Happy' | 'Short' | 'readable'] regex :-)
Capture with [capture <Kleenexp>]
(short: [c <Kleenexp>]
, named capture group: [c:name <Kleenexp>]
):
This is a [capture:adjective 1+ [#letter | ' ' | ',']] regex :-)
Reverse a pattern that matches a single character with not
:
[#start_line [0+ #space] [not ['-' | #digit | #space]] [0+ not #space]]
Define your own macros with #name=[<regex>]
:
This is a [#trochee #trochee #trochee] regex :-)[
[comment 'see xkcd 856']
#trochee=['Robot' | 'Ninja' | 'Pirate' | 'Doctor' | 'Laser' | 'Monkey']
]
Lookahead and lookbehind:
[#start_string
[lookahead [0+ #any] #lowercase]
[lookahead [0+ #any] #uppercase]
[lookahead [0+ #any] #digit]
[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
[6+ #token]
#end_string
]
[")" [not lookbehind "()"]]
Add comments with the comment
operator:
[[comment "Custom macros can help document intent"]
#has_lower=[lookahead [0+ not #lowercase] #lowercase]
#has_upper=[lookahead [0+ not #uppercase] #uppercase]
#has_digit=[lookahead [0+ not #digit] [capture #digit]]
#no_common=[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
#start_string #has_lower #has_upper #has_digit #no_common [6+ #token_character] #end_string
]
( Print )
Some macros you can use:
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#any | #a | /./ |
May or may not match newlines depending on your engine and whether the Kleenexp is compiled in multiline mode, see your regex engine's documentation |
#any_at_all | #aaa | [#any \| #newline] |
|
#newline_character | #nc | /[\r\n\u2028\u2029]/ |
Any of #cr , #lf , and in unicode a couple more (explanation) |
#newline | #n | [#newline_character \| #crlf] |
Note that this may match 1 or 2 characters! |
#not_newline | #nn | [not #newline_character] |
Note that this may only match 1 character, and is not the negation of #n but of #nc ! |
#linefeed | #lf | /\n/ |
See also #n (explanation) |
#carriage_return | #cr | /\r/ |
See also #n (explanation) |
#windows_newline | #crlf | /\r\n/ |
Windows newline (explanation) |
#tab | #t | /\t/ |
|
#not_tab | #nt | [not #tab] |
|
#digit | #d | /\d/ |
|
#not_digit | #nd | [not #d] |
|
#letter | #l | /[A-Za-z]/ |
When in unicode mode, this will be translated as \p{L} in languages that support it (and throw an error elsewhere) |
#not_letter | #nl | [not #l] |
|
#lowercase | #lc | /[a-z]/ |
Unicode: \p{Ll} |
#not_lowercase | #nlc | [not #lc] |
|
#uppercase | #uc | /[A-Z]/ |
Unicode: \p{Lu} |
#not_uppercase | #nuc | [not #uc] |
|
#space | #s | /\s/ |
|
#not_space | #ns | [not #space] |
|
#token_character | #tc | [#letter \| #digit \| '_'] |
|
#not_token_character | #ntc | [not #tc] |
|
#token | [#letter \| '_'][0+ #token_character] |
||
#word_boundary | #wb | /\b/ |
|
#not_word_boundary | #nwb | [not #wb] |
|
#quote | #q | ' |
|
#double_quote | #dq | " |
|
#left_brace | #lb | [ '[' ] |
|
#right_brace | #rb | [ ']' ] |
|
#start_string | #ss | /\A/ (this is the same as #sl unless the engine is in multiline mode) |
|
#end_string | #es | /\Z/ (this is the same as #el unless the engine is in multiline mode) |
|
#start_line | #sl | /^/ (this is the same as #ss unless the engine is in multiline mode) |
|
#end_line | #el | /$/ (this is the same as #es unless the engine is in multiline mode) |
|
#\<char1>..\<char2>, e.g. #a..f , #1..9 |
[<char1>-<char2>] |
char1 and char2 must be of the same class (lowercase english, uppercase english, numbers) and char1 must be strictly below char2 , otherwise it's an error (e.g. these are errors: #a..a , #e..a , #0..f , #!..@ ) |
|
#integer | #int | [[0-1 '-'] [1+ #digit]] |
|
#digits | #ds | [1+ #digit] |
|
#decimal | [#int [0-1 '.' #digits] |
||
#float | [[0-1 '-'] [[#digits '.' [0-1 #digits] \| '.' #digits] [0-1 #exponent] \| #int #exponent] #exponent=[['e' \| 'E'] [0-1 ['+' \| '-']] #digits]] |
||
#hex_digit | #hexd | [#digit \| #a..f \| #A..F] |
|
#hex_number | #hexn | [1+ #hex_digit] |
|
#letters | [1+ #letter] |
||
#capture_0+_any | #c0 | [capture 0+ #any] |
|
#capture_1+_any | #c1 | [capture 1+ #any] |
|
#vertical_tab | /\v/ |
||
#bell | /\a/ |
||
#backspace | /[\b]/ |
||
#formfeed | /\f/ |
* Definitions /wrapped in slashes/
are in old regex syntax. This is used when the macro isn't simply a short way to express something you could express otherwise in Kleenexp.)
For example,
"[not ['a' | 'b']]"
compiles to /[^ab]/
"[#digit | [#a..f]]"
compiles to /[0-9a-f]/
Coming soon:
#ip
, #email
, #us_phone
, #intl_phone
, #link
, #url
, #timestamp
, #mmddyy
...[#0..255]
(which translates to ['25' #0..5 | '2' #0..4 #d | '1' #d #d | #1..9 #d | #d]
#dot
, #hash
, #tilde
...abc[ignore_case 'de' #lowercase]
(which translates to abc[['D' | 'd'] ['E'|'e'] [[A-Z] | [a-z]]
; Today you just wouldn't try.)[1+:fewest ...]
(for non-greedy repeat)ke.add_macro("#camelcase=[1+ [#uppercase [0+ lowercase]]], path_optional)
, [add_macro #month=['january', 'January', 'Jan', ....]]
ke.import_macros("./apache_logs_macros.ke")
, ke.export_macros("./my_macros.ke")
, and maybe arrange built-in ke macros in packages#month
, #word
, #year_month_day
or #yyyy-mm-dd
TODO.txt
.Ease of migration trumps any other design consideration. Without a clear, painless migration path, there will be no adoption.
kgrep [#start_line [1+ #letter] #end_line]
instead of ^[a-zA-Z]+$
is not fun when exploring, but keeping [#sl#ls#el]
in your codebase for posterity is not a huge improvement in readability)$ grep "`ke "[1+ #d] Reasons"`"
^[^#]\*$
/Yo, dawg!/
matches "Yo, dawg!" and no other string)"
and \
in most languages' string literals, `
or $
in bash ('
is OK because every language that allows '
strings also allows "
strings. Except for SQL. Sorry, SQL.)/\bX.X.X.X\b
where X
is (?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)/
0..255
and get (?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)
See Grammar.
PRs welcome. If it's a major change, maybe open a "feature suggestion" issue first suggesting the feature, get a blessing, and agree on a design.
. configuration and build system
├── ke/ Python package, includes transpiler and `import re` drop-in replacement API
│ ├── __init__.py Python API, chooses between Python and Rust transpilers
│ ├── pyke.py Python transpiler top-level
│ ├── parser.py Grammar and visitor-pattern transformation of parse tree to Abstract Syntax Tree (AST)
│ ├── compiler.py Translation from AST to Asm tree (regex-like Intermediate Representation), builtin macro definitions
│ └── asm.py Translation from Asm tree to regex syntax string
├── tests Test suite written in Python that can run against both implementations
├── _ke Python extension that exposes Rust transpiler to Python package
├── vscode vscode extension that invokes Kleenexp transpiler before search and replace tools, uses kleenexp-wasm
├── rust Rust crate, includes transpiler and API
│ ├── lib.py Rust crate, includes transpiler and `regex` crate drop-in replacement API
│ ├── parse.py Parser that outputs AST
│ └── compiler.py Translation from AST to Asm tree (regex-like Intermediate Representation), builtin macro definitions,
│ translation from Asm tree to regex syntax string, transpiler top level
└── kleenexp-wasm npm package that exposes Rust transpiler to JavaScript ecosystem
Before making commits make sure to run these commands:
pip install pre-commit
pre-commit install
This will run autoformatting tools like black
on all files you changed whenever you try to commit. If they make changes, you will need to git add
the changes before you can commit.
Before every commit, make sure the tests pass:
pytest
maturin develop pytest && KLEENEXP_RUST=1 pytest # optional
Before opening a PR, please review your own diff and make sure everything is well tested and has clear descriptive names and documentation wherever names are not enough (e.g. to explain why a complex approach was taken).
Regular Expressions: Very popular, occasionally reads like line noise, backslash for escape.
(?:What is your (?:name|quest|favourite colour)\?\s?){1,3}
Kleenexp (this right here): Terse, readable, almost never needs escaping. Python compiler, almost ready VSCode extension, online playground.
[1-3 'What is your ' ['name' | 'quest' | 'favourite colour'] '?' [0-1 #space]]
Melody - More verbose, supports macros, backslash escapes only for quotes. Rust compiler, babel plugin. Improves with time, getting quite impressive.
1 to 3 of match {
"What is your ";
either {
"name";
"quest";
"favorite color";
}
"?";
0 to 1 of " ";
}
Pomsky: Very similar to legacy regex syntax; supports macros and number ranges; supports unicode; amazing error messages help convert legacy to new syntax; backslash escapes only for quotes. Rust compiler: As of today no built in way to use outside Rust. (But they seem to be planning it.)
('What is your ' ('name'|'quest'|'favorite colour')'?' [s]){1,3}
Eggex: Part of a new Unix shell's syntax. Big on composition (macros in Kleenexp). Uses backslash for character classes. Production-ready within the shell, not supported elsewhere yet.
/ ('What is your ' ('name' | 'quest' | 'favorite color') '?' ' '?){1,3} /
Raku regexes: Similar to Eggex, part of Raku (the artist formerly known as Perl 6)
Verbal expressions: Embedded DSL, supports 14(!) languages (to some extent? I didn't verify), but isn't actively maintained
const tester = VerEx()
.multiple(
VerEx()
.then('What is your ')
.then(
VerEx().then('name').or('quest').or('favorite color')
)
.then('?')
.maybe(' '),
1,
3,
)
There are many more eDSLs, but I will not list them as they are less relevant in my opinion
(c) 2015-2022 Aur Saraf. Kleenexp is distrubuted under the MIT license.