SonOfLilit / kleenexp

modern regular expression syntax everywhere with a painless upgrade path
MIT License
73 stars 16 forks source link

Build Status

Kleenexp: Regex for Humans

Try it online.

Available for as a plugin for Visual Studio Code; and as libraries for Python, JavaScript, Typescript, and Rust.

Demo

Regular Expressions are one of the best ideas in the field of software. However, their $#%! syntax is an accident from 1968. Kleene Expressions (after mathematician Stephen Kleene who discovered regex) are a drop-in replacement syntax that compiles to languages' native regex libraries, promising full bug-for-bug API compatibility.

Now 100% less painful to migrate! (You heard that right: migration is not painful at all.)

Try it

Installation and usage

Python Library

pip install kleenexp

Now just write ke wherever you used to write re:

import ke

username = input('Choose a username:')
# if not re.match(r'[A-Za-z][A-Za-z\d]*$', password):
if not ke.match('[#letter [0+ [#letter | #digit]] #end_string]', username):
    print("Invalid username")
else:
    password = input('Enter a new password:')
    # if re.match(r'\A(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=\D*(\d))(?!.*(?:123|pass|Pass))\w{6,}\Z', password):
    if not ke.match('''[
      #has_lower=[lookahead [0+ not #lowercase] #lowercase]
      #has_upper=[lookahead [0+ not #uppercase] #uppercase]
      #has_digit=[lookahead [0+ not #digit] [capture #digit]]
      #no_common=[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]

      #start_string #has_lower #has_upper #has_digit #no_common [6+ #token_character] #end_string
    ]''', password):
        print("Password should have at least one uppercase letter, one lowercase, one digit.")
        print("And nothing obvious like '123'")
    else:
        ...

Be sure to read the tutorial below!

A Taste of the Syntax

Kleenexp:

Hello. My name is Inigo Montoya. You killed my Father. Prepare to die.

Regex:

Hello\. My name is Inigo Montoya\. You killed my Father\. Prepare to die\.

Kleenexp:

[1-3 'What is your ' ['name' | 'quest' | 'favourite colour'] '?' [0-1 #space]]

Regex:

(?:What is your (?:name|quest|favourite colour)\?)\s?){1,3}

Kleenexp:

Hello. My name is [capture:name #tmp ' ' #tmp #tmp=[#uppercase [1+ #lowercase]]]. You killed my ['Father' | 'Mother' | 'Hamster']. Prepare to die.

Regex:

Hello\. My name is (?<name>[A-Z][a-z]+ [A-Z][a-z]+)\. You killed my (?:Father|Mother|Hamster)\. Prepare to die\.`

Or, if you're in a hurry, you can use the shortened form:

Hello. My name is [c:name#uc[1+#lc]' '#uc[1+#lc]]. You killed my ['Father'|'Mother'|'Hamster']. Prepare to die.

(And when you're done, you can use our automatic tool -- in development -- to convert the short Kleenexp to the more readable version, and commit that instead.)

Syntax Cheat Sheet

Cheat Sheet ( Print )

More on the syntax, additional examples, and the design criteria that led to its design, below.

How We're Going To Take Over The World

This is not a toy project meant to prove a technological point. This is a serious attempt to fix something that is broken in the software ecosystem and has been broken since before we were born. We have experience running R&D departments, we understand how technology decisions are made, and we realise success here hinges more on "growth hacking" and on having a painless and risk-free migration story than it does on technical excellence.

Step 1 is to introduce Kleenexp to the early adopter developer segment by releasing great plugins for popular text editors like Visual Studio Code, with better UX (syntax highlighting, autocompletion, good error messages, ...) and a great tutorial. Adopting a new regex syntax for your text editor is low-risk and requires no coordination between stakeholders.

Step 2 is to aim at hobbyist software projects by making our JavaScript adoption story as painless and risk-free as possible (since JavaScript has the most early-adopting and fast-moving culture). In addition to a runtime drop-in syntax adapter, we will write a Babel plugin that translates Kleenexp syntax into legacy regex syntax at compile time, to enable zero-overhead usage.

Step 3 is to aim at startups by optimizing and testing the implementations until they're ready for deployment in big-league production scenarios.

Step 4 is to make it possible for large legacy codebases to switch by releasing tools that automatically convert a codebase from legacy syntax to Kleenexp (like python's 2to3 or AirBnB's ts-migrate)

Roadmap

Name

Kleene Expressions are named after mathematician Stephen Kleene who invented regular expressions.

Wikipedia says:

Although his last name is commonly pronounced /ˈkliːni/ KLEE-nee or /kliːn/ KLEEN, Kleene himself pronounced it /ˈkleɪni/ KLAY-nee. His son, Ken Kleene, wrote: "As far as I am aware this pronunciation is incorrect in all known languages. I believe that this novel pronunciation was invented by my father."

However, with apologies to the late Dr. Kleene, "Kleene expressions" is pronounced "Clean expressions" and not "Klein expressions."

Real World Examples

Removing parenthesis:

import ke

def remove_parentheses(line):
    if ke.search("[#open=['('] #close=[')'] #open [0+ not #close] #open]", line):
        raise ValueError()
    return ke.sub("[ '(' [0+ not ')'] ')' ]", '', line)
assert remove_parentheses('a(b)c(d)e') == 'ace'

The original with regex is from a hackathon project I participated in, and looks like this:

import re

def remove_parentheses(line):
    if re.search(r'\([^)]*\(', line):
        raise ValueError()
    return re.sub(r'\([^)]*\)', '', line)
assert remove_parentheses('a(b)c(d)e') == 'ace'

For replacement with sub(), the syntax for the replacement is the same as for regexes.

import ke
assert ke.sub("[[capture '.' [6 #digit] ] [0+ #digit] ]", r"\1", "3.14159265359") == "3.141592"
assert ke.sub("Hi [capture:name 1+ #letter]!", r"\g<name> \g<name>!", "Hi Bobby!") == "Bobby Bobby!"

Another example, rewriting paths in Django:

import ke
from django.urls import path, re_path

from . import views

urlpatterns = [
  path('articles/2003/', views.special_case_2003),
  re_path(ke.re("[#start_line]articles/[capture:year 4 #digit]/[#end_line]"), views.year_archive),
  re_path(ke.re("[#start_line]articles/[capture:year 4 #digit]/[capture:month 2 #digit]/[#end_line]"),views.month_archive),
  re_path(ke.re(
    "[#start_line]articles/[capture:year 4 #digit]/[capture:month 2 #digit]/[capture:slug 1+ [#letter | #digit | '_' | '-']]/[#end_line]"), views.article_detail),
]

The original with regex is taken from Django documentation and looks like this:

from django.urls import path, re_path

from . import views

urlpatterns = [
    path('articles/2003/', views.special_case_2003),
    re_path(r'^articles/(?P<year>[0-9]{4})/$', views.year_archive),
    re_path(r'^articles/(?P<year>[0-9]{4})/(?P<month>[0-9]{2})/$', views.month_archive),
    re_path(r'^articles/(?P<year>[0-9]{4})/(?P<month>[0-9]{2})/(?P<slug>[\w-]+)/$', views.article_detail),
]

Syntax

Anything outside of brackets is a literal:

This is a (short) literal :-)

You can use macros like #digit (short: #d) or #any (#a):

This is a [#lowercase #lc #lc #lc] regex :-)

You can repeat with n, n+ or n-m:

This is a [1+ #lc] regex :-)

If you want either of several options, use |:

This is a ['Happy' | 'Short' | 'readable'] regex :-)

Capture with [capture <Kleenexp>] (short: [c <Kleenexp>], named capture group: [c:name <Kleenexp>]):

This is a [capture:adjective 1+ [#letter | ' ' | ',']] regex :-)

Reverse a pattern that matches a single character with not:

[#start_line [0+ #space] [not ['-' | #digit | #space]] [0+ not #space]]

Define your own macros with #name=[<regex>]:

This is a [#trochee #trochee #trochee] regex :-)[
    [comment 'see xkcd 856']
    #trochee=['Robot' | 'Ninja' | 'Pirate' | 'Doctor' | 'Laser' | 'Monkey']
]

Lookahead and lookbehind:

[#start_string
  [lookahead [0+ #any] #lowercase]
  [lookahead [0+ #any] #uppercase]
  [lookahead [0+ #any] #digit]
  [not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
  [6+ #token]
  #end_string
]
[")" [not lookbehind "()"]]

Add comments with the comment operator:

[[comment "Custom macros can help document intent"]
  #has_lower=[lookahead [0+ not #lowercase] #lowercase]
  #has_upper=[lookahead [0+ not #uppercase] #uppercase]
  #has_digit=[lookahead [0+ not #digit] [capture #digit]]
  #no_common=[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]

  #start_string #has_lower #has_upper #has_digit #no_common [6+ #token_character] #end_string
]

Cheat Sheet ( Print )

Some macros you can use:

Long Name Short Name Definition* Notes
#any #a /./ May or may not match newlines depending on your engine and whether the Kleenexp is compiled in multiline mode, see your regex engine's documentation
#any_at_all #aaa [#any \| #newline]
#newline_character #nc /[\r\n\u2028\u2029]/ Any of #cr, #lf, and in unicode a couple more (explanation)
#newline #n [#newline_character \| #crlf] Note that this may match 1 or 2 characters!
#not_newline #nn [not #newline_character] Note that this may only match 1 character, and is not the negation of #n but of #nc!
#linefeed #lf /\n/ See also #n (explanation)
#carriage_return #cr /\r/ See also #n (explanation)
#windows_newline #crlf /\r\n/ Windows newline (explanation)
#tab #t /\t/
#not_tab #nt [not #tab]
#digit #d /\d/
#not_digit #nd [not #d]
#letter #l /[A-Za-z]/ When in unicode mode, this will be translated as \p{L} in languages that support it (and throw an error elsewhere)
#not_letter #nl [not #l]
#lowercase #lc /[a-z]/ Unicode: \p{Ll}
#not_lowercase #nlc [not #lc]
#uppercase #uc /[A-Z]/ Unicode: \p{Lu}
#not_uppercase #nuc [not #uc]
#space #s /\s/
#not_space #ns [not #space]
#token_character #tc [#letter \| #digit \| '_']
#not_token_character #ntc [not #tc]
#token [#letter \| '_'][0+ #token_character]
#word_boundary #wb /\b/
#not_word_boundary #nwb [not #wb]
#quote #q '
#double_quote #dq "
#left_brace #lb [ '[' ]
#right_brace #rb [ ']' ]
#start_string #ss /\A/ (this is the same as #sl unless the engine is in multiline mode)
#end_string #es /\Z/ (this is the same as #el unless the engine is in multiline mode)
#start_line #sl /^/ (this is the same as #ss unless the engine is in multiline mode)
#end_line #el /$/ (this is the same as #es unless the engine is in multiline mode)
#\<char1>..\<char2>, e.g. #a..f, #1..9 [<char1>-<char2>] char1 and char2 must be of the same class (lowercase english, uppercase english, numbers) and char1 must be strictly below char2, otherwise it's an error (e.g. these are errors: #a..a, #e..a, #0..f, #!..@)
#integer #int [[0-1 '-'] [1+ #digit]]
#digits #ds [1+ #digit]
#decimal [#int [0-1 '.' #digits]
#float [[0-1 '-'] [[#digits '.' [0-1 #digits] \| '.' #digits] [0-1 #exponent] \| #int #exponent] #exponent=[['e' \| 'E'] [0-1 ['+' \| '-']] #digits]]
#hex_digit #hexd [#digit \| #a..f \| #A..F]
#hex_number #hexn [1+ #hex_digit]
#letters [1+ #letter]
#capture_0+_any #c0 [capture 0+ #any]
#capture_1+_any #c1 [capture 1+ #any]
#vertical_tab /\v/
#bell /\a/
#backspace /[\b]/
#formfeed /\f/

* Definitions /wrapped in slashes/ are in old regex syntax. This is used when the macro isn't simply a short way to express something you could express otherwise in Kleenexp.)

For example,

"[not ['a' | 'b']]" compiles to /[^ab]/

"[#digit | [#a..f]]" compiles to /[0-9a-f]/

Coming soon:

Design criteria

Migration

Ease of migration trumps any other design consideration. Without a clear, painless migration path, there will be no adoption.

Syntax

Grammar

See Grammar.

Contributing

PRs welcome. If it's a major change, maybe open a "feature suggestion" issue first suggesting the feature, get a blessing, and agree on a design.

Architecture

.                   configuration and build system
├── ke/                 Python package, includes transpiler and `import re` drop-in replacement API
│   ├── __init__.py        Python API, chooses between Python and Rust transpilers
│   ├── pyke.py            Python transpiler top-level
│   ├── parser.py          Grammar and visitor-pattern transformation of parse tree to Abstract Syntax Tree (AST)
│   ├── compiler.py        Translation from AST to Asm tree (regex-like Intermediate Representation), builtin macro definitions
│   └── asm.py             Translation from Asm tree to regex syntax string
├── tests               Test suite written in Python that can run against both implementations
├── _ke                 Python extension that exposes Rust transpiler to Python package
├── vscode              vscode extension that invokes Kleenexp transpiler before search and replace tools, uses kleenexp-wasm
├── rust                Rust crate, includes transpiler and API
│   ├── lib.py              Rust crate, includes transpiler and `regex` crate drop-in replacement API
│   ├── parse.py            Parser that outputs AST
│   └── compiler.py         Translation from AST to Asm tree (regex-like Intermediate Representation), builtin macro definitions,
│                           translation from Asm tree to regex syntax string, transpiler top level
└── kleenexp-wasm       npm package that exposes Rust transpiler to JavaScript ecosystem

PR Flow

Before making commits make sure to run these commands:

pip install pre-commit
pre-commit install

This will run autoformatting tools like black on all files you changed whenever you try to commit. If they make changes, you will need to git add the changes before you can commit.

Before every commit, make sure the tests pass:

pytest
maturin develop pytest && KLEENEXP_RUST=1 pytest   # optional

Before opening a PR, please review your own diff and make sure everything is well tested and has clear descriptive names and documentation wherever names are not enough (e.g. to explain why a complex approach was taken).

Similar works

License

(c) 2015-2022 Aur Saraf. Kleenexp is distrubuted under the MIT license.