Towerism / vim

Automatically exported from code.google.com/p/vim
0 stars 0 forks source link

Feature: Extended regular expressions #99

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
(See https://code.google.com/r/mike-vim-extended-regex/ for the source code of 
this feature. Diffs are here: 
https://code.google.com/r/mike-vim-extended-regex/source/list )

I've implemented support for extended regular expressions in Vim, somewhat 
similar to Perl's extended regex feature, which allows you to make complicated 
regexes (especially in Vimscript files) easier to read, by including whitespace 
and comments in them.  (Vim already allows multiline regexes.)  I'm hoping 
that, after any changes suggested on this forum, this will be a useful addition 
to Vim.

One of the trickiest, and perhaps most contentious, parts is choosing the 
syntax to use -- how to turn on extended mode, and what comments should look 
like.  I am very open to feedback and changes on this.  Below, I present the 
reasoning behind my initial choices.

This is what I have implemented:

- To turn on extended mode, put \# at the beginning of your regex.
- A comment is enclosed in double-braces, like {{ this }}.
- To match a space rather than having it be ignored, use "\ ".

Here is a simple example.  syntax/c.vim includes this, for syntax highlighting 
of backslash-escaped sequences inside strings in C:

    " String and Character constants
    " Highlight special characters (those which have a backslash) differently
    syn match   cSpecial    display contained "\\\(x\x\+\|\o\{1,3}\|.\|$\)"

With extended regular expressions, the above could be written with whitespace 
and comments:

    " String and Character constants
    " Highlight special characters (those which have a backslash) differently
    syn match   cSpecial    display contained 
                \ "\#
                \    \\             {{ literal backslash, followed by one of... }}
                \    \(
                \        x \x\+     {{ hex, e.g. '\x2c' }}
                \      \|
                \        \o\{1,3}   {{ octal, e.g. '\755' }}
                \      \|
                \        .          {{ e.g. '\n' or '\t' etc. }}
                \      \|
                \        $          {{ end of line }}
                \    \)"

I have not yet written tests or docs.  If you want, I would be happy to do so.

As for the syntax: Obviously it is best not to invent a brand new syntax unless 
there is a good reason to do so.  I would have preferred to use Perl's syntax, 
which is:

- To turn on extended mode, use "x" in the flags area after the regex, e.g. 
/foo/x
- A comment begins with (?# and ends with )

Unfortunately, neither one of those worked out especially well in Vim.  For 
turning on extended mode, Vim makes only very light use of "flags" after 
regular expressions.  In fact, although it allows a few flags after the :s 
(substitute) command, in general it doesn't use flags after regular 
expressions.  In Vim, usually the same effect is achieved by putting special 
codes at the beginning of a regex, such as \c to ignore case.

And for comments, Using (?# ... ) would work, but would be somewhat awkward.  
In Perl, both the () operator and the ? operator are "magic" by default (do not 
need to be escaped with a backslash to give them special meaning).  But in Vim, 
the opposite is true: By default, () just matches parentheses, and ? just 
matches a question mark.  So in a Vim regex, a comment would look like \(\?# 
this \), which is just too ugly and too tricky for people to remember.

So I played around with a number of alternative syntax options.

-----

1. Syntax for turning on extended mode:

Consistent with other regex syntax in Vim, it seemed to me that the best way to 
let the user turn on extended mode would be the presence of some special 
sequence at the beginning of the regex, similar to Vim's current use of \c or 
\C for case sensitivity, \m \M \v \V to choose a "magic" mode, and so on.  Here 
is a list of all available one-character backslash sequences:

    \!   \"   \#   \$   \'   \,   \-   \:   \;   \g   \j   \q   \y   \^   \`

I would have liked to use \x or \e to indicate extended mode, but both of those 
are already used. (\x means any hex digit; \e means the escape key.)

Given those choices, my favorite was:

    \#

... mainly because "#" is used in many programming languages to begin a comment.

Other possibilities: Vim already uses \% and \z as prefixes for a number of 
other commands, so two options that seem pretty good to me are:

    \%e
or
    \zx

I sort of like \%e.  It has the advantage of being somewhat mnemonic (e for 
extended), and also it avoids using up a punctuation character (#) that might 
be better saved for other future enhancements.

-----

2. Syntax for comments:

One issue is: Should turning comments on/off require "magic" characters or not? 
 At first I thought, of course it would have to include magic characters; but 
then it occurred to me that we could just use a character sequence that is 
somewhat unlikely to appear in regexes, and that is easy to represent as 
regular characters (rather than comment delimiters) in a regex if necessary.

I like {{ double braces }} because:

- They look nice and are easy to type.
- They don't conflict with any other regex syntax patterns.  Yes, braces are 
used to indicate a count, e.g. x{1,3} for one to three x's, but that uses 
single braces.
- It is easy to represent a match for the actual characters "{{" in an extended 
regex: Just put a space between them, "{ {".

Other options:

If we use \# to turn on extended mode, I thought it might be nice to use some 
sort of comment delimiter that includes the "#" character, but I couldn't come 
up with anything that good.  The best I could come up with is ## to begin a 
comment and ## again to end a comment, but that could lead to trouble if the 
user tries to mark off a comment with "#############".  Other possibilities:
    #( )
    {# #}

We can't use "#" by itself for comments, with end-of-line indicating the end of 
the comment, because of the way Vim multiline strings work.  In Vim, when you 
write

    let x = "this is
                \ a string"

What you get is, "this is a string".  There is no embedded newline in the 
result.

I also thought it might be nice to somehow use the " double-quote character to 
indicate comments, since that is Vimscript's comment character; but the 
double-quote character would be a bad choice because often, in Vimscript, the 
regex itself is double-quoted, so you would have to backslash-escape all the 
embedded double-quote characters, which would get a bit messy.

-----

A few more details about the syntax:

- Comments support nesting.  This is mainly useful while debugging your regex, 
to "comment out" part of it.

- Comments and extra whitespace are not allowed in places such as inside 
collections such as [a-z], repetition indicators such as {1,3}, in the middle 
of special sequences such as "\%$", and so on.

Original issue reported on code.google.com by m...@morearty.com on 12 Dec 2012 at 12:55

GoogleCodeExporter commented 9 years ago

Original comment by chrisbr...@googlemail.com on 9 Jan 2015 at 12:18