aligrudi / neatvi

A small vi/ex editor for editing UTF-8 text
http://litcave.rudi.ir/
305 stars 25 forks source link

ex_line() should be able to escape "|" character #38

Closed kyx0r closed 2 years ago

kyx0r commented 2 years ago

Hello Ali,

I fixed this long time ago in my fork, but there is a problem of not being able to escape "|" in prompt.

Because you can't escape it, ex_line will treat it like a pipe for commands. The implication is that it is impossible to run substitution commands that have "|" character in them as the searchable text.

This commit fixed it ex.c https://github.com/kyx0r/neatvi/commit/c435fd1d1bfce67eeb0c7e9bd54d6c7f1717dbe5

kyx0r commented 2 years ago

Funny enough though, it seems like even if escape is implemented, you can't fix this problem entirely. Because there will still be no way to properly escape the case when | is inside the bracket expression used for substitution. But it will work for all other cases though. In my fork this isn't a problem now, since I do support escapes in brackets.

It's like one problem, chains into another one, ehh pain in the ass. But I think even partially fixing this issue might be worth a try. Or you could also do some different approach say like if it's a substitution command, parse it till the end first, that way you can avoid having to implement escapes. But parsing it won't be any simple code, it be as bad in taste as re_groupcount() currently is.

kyx0r commented 2 years ago

Or you could exclude the posibility of piping anything entirely if it is substitution command. This might be the best middle ground.

aligrudi commented 2 years ago

Hell Kyryl,

Kyryl @.***> wrote:

Or you could exclude the posibility of piping anything entirely if it is substitution command. This might be the best middle ground.

Well, I think the right solution is to do what the "Command Line Parsing in ex" section of ex(1p) says; I shall have a look.

Ali
kyx0r commented 2 years ago
           iii. If the command is an s command, characters up to the first
                non-<backslash>-escaped <newline>, or second
                non-<backslash>-escaped delimiter character, shall be
                skipped and be part of the command.

Basically it says that there has to be means to escape the delimiter. So the escape will make it into the regex

And the the paragraph "Replacement Strings in ex" states: Otherwise, any character following a shall be treated as that literal character, and the escaping shall be discarded.

Having to discard the escapes implies a super ugly implementation. One could avoid it by replacing every escape with a NULL if it's inside the bracket, then treat NULL in regex as it be an escape, having a NULL in bracket is an error but we can repurpose the NULL check and make it be useful. Error checking does not affect valid expressions, invalid ones ehh, thats on them, it will fault either way, but this way with a cool infinite loop and a buffer overflow ;) . But even with this efficient solution it won't work on libc regex because likely the will treat NULL as an error instead of continuing to parse in hopes of finding the closing ].

This is a tough one to solve without writing garbage code that will duplicate strings, shrink the expression by removing only the escapes inside [] brackets.

kyx0r commented 2 years ago

Let me rephrase this - it's not hard to write code that will shrink the expression and discard the escapes. This is perfect example of redundant code, that does bunch of crap only for the sake of satisfying one condition. The "why" of things being done this way is so weak, that likely my students would lose hope in me as their professor. No wonder they call it the poosix standard, it might produce clear designs on paper but difficult to implement in practice.

aligrudi commented 2 years ago

Kyryl @.***> wrote:

           iii. If the command is an s command, characters up to the first
                non-<backslash>-escaped <newline>, or second
                non-<backslash>-escaped delimiter character, shall be
                skipped and be part of the command.

Basically it says that there has to be means to escape the delimiter. So the escape will make it into the regex

And the the paragraph "Replacement Strings in ex" states: Otherwise, any character following a shall be treated as that literal character, and the escaping shall be discarded.

Having to discard the escapes implies a super ugly implementation. One could avoid it by replacing every escape with a NULL if it's inside the bracket, then treat NULL in regex as it be an escape, having a NULL in bracket is an error but we can repurpose the NULL check and make it be useful. Error checking does not affect valid expressions, invalid ones ehh, thats on them, it will fault either way, but this way with a cool infinite loop and a buffer overflow ;) . But even with this efficient solution it won't work on libc regex because likely the will treat NULL as an error instead of continuing to parse in hopes of finding the closing ].

This is a tough one to solve without writing garbage code that will duplicate strings, shrink the expression by removing only the escapes inside [] brackets.

Currently exline() extracts a command and its arguments from input and sends them to one of ec functions. I think these conditions are easier to implement if the input is read character by character directly in ec_ functions.

Ali
kyx0r commented 2 years ago

Well, see if you can come up with a workable patch. But I won't blame you if you decide to simply not solve this problem. Because the spec sucks for not accomodating any kind of escaping in regex brackets. It doesn't leave much of open ended options for us.

kyx0r commented 2 years ago

Though looking at it purely from regex perspective escapes do seem like unnecesary in brk. But because of that users of regex have to build those enourmous ugly workarounds. Sounds like undersight on the standard's side. It would of been way better to keep escapes everywhere so that any regex is consistent everywhere in its string. The cost of 1 if statement, ehh.

aligrudi commented 2 years ago

Kyryl @.***> wrote:

Though looking at it purely from regex perspective escapes do seem like unnecesary in brk. But because of that users of regex have to build those enourmous ugly workarounds. Sounds like undersight on the standard's side. It would of been way better to keep escapes everywhere so that any regex is consistent everywhere in its string. The cost of 1 if statement, ehh.

Pushed a fix. Please report if something is amiss.

Ali
kyx0r commented 2 years ago

Hello Ali,

That came out to be a beefy patch in terms of number of changes. So I thought of course I won't need all of them ported into nextvi, but one question buges me, what justified removing this code

s = ex_cmd(s, arg);
while (isspace((unsigned char) *s))
    s++;

from the function ex_filearg() at the beginning? Can I also do that in nextvi? Running all those paths in my head from where ex_filearg() roughly leaves me under a vague impression that from some places it is necesary to call ex_cmd() and some not because it was called previously. Was that code 100% redundant or not? There are many changes in your last commits so I might be missing something.

I just want to be %100 sure because if I remove that code I don't accidentally make regressions that would be hard to test.

Regards, Kyryl.

aligrudi commented 2 years ago

Hello Kyryl,

Kyryl @.***> wrote:

That came out to be a beefy patch in terms of number of changes. So I thought of course I won't need all of them ported into nextvi, but one question buges me, what justified removing this code

s = ex_cmd(s, arg); while (isspace((unsigned char) *s)) s++;

All ex commands (ec_* functions) expect the whole ex line as a parameter. For each ex line, ex_loc(), ex_cmd(), and ex_arg() may be called to extract command addresses, command name, and command arguments, respectively.

Before this patch, for reading arguments one of ex_arg(), ex_filearg(), or ex_argeol() was called depending on the ex command. That is no longer necessary; ex_arg() extracts the arguments based on the command. The new ex_filearg() merely replaces % and # in the string passed to it; this is useful for commands that expect a path or a command. So now you can first extract the argument using ex_arg() and then pass it to ex_filearg().

It is probably a good idea now to rename ex_filearg(). Suggestions?

from the function ex_filearg() at the beginning? Can I also do that in nextvi? Running all those paths in my head from where ex_filearg() roughly leaves me under a vague impression that from some places it is necesary to call ex_cmd() and some not because it was called previously. Was that code 100% redundant or not? There are many changes in your last commits so I might be missing something.

It is possible to call ex_loc(), ex_cmd(), and ex_arg() once in exexec() and pass three arguments to ec*(). I shall give it a try to see if it simplifies ex.c.

Thanks, Ali

aligrudi commented 2 years ago

How about the following patch?

Ali

diff --git a/ex.c b/ex.c index 7066302..9affa88 100644 --- a/ex.c +++ b/ex.c @@ -114,50 +114,6 @@ char *ex_filetype(void) return bufs[0].ft; }

-static char ex_loc(char src, char loc); -static char ex_cmd(char src, char cmd); -static char ex_arg(char src, char *arg);

-/ read ex command location / -static char ex_loc(char src, char *loc) -{

-static int ec_write(char ec); +static int ec_write(char loc, char cmd, char arg);

static int ex_modifiedbuffer(char *msg) { if (!lbuf_modified(xb)) return 0; if (xaw && ex_path()[0])

-static int ec_buffer(char ec) +static int ec_buffer(char loc, char cmd, char arg) {

-static int ec_quit(char ec) +static int ec_quit(char loc, char cmd, char arg) {

-static int ec_edit(char ec) +static int ec_edit(char loc, char cmd, char arg) { char msg[128];

-static int ec_read(char ec) +static int ec_read(char loc, char cmd, char arg) {

-static int ec_write(char ec) +static int ec_write(char loc, char cmd, char arg) {

-static int ec_insert(char ec) +static int ec_insert(char loc, char cmd, char arg) {

-static int ec_print(char ec) +static int ec_print(char loc, char cmd, char arg) {

-static int ec_null(char ec) +static int ec_null(char loc, char cmd, char arg) {

-static int ec_delete(char ec) +static int ec_delete(char loc, char cmd, char arg) {

-static int ec_yank(char ec) +static int ec_yank(char loc, char cmd, char arg) {

-static int ec_put(char ec) +static int ec_put(char loc, char cmd, char arg) {

-static int ec_lnum(char ec) +static int ec_lnum(char loc, char cmd, char arg) {

-static int ec_undo(char ec) +static int ec_undo(char loc, char cmd, char arg) { return lbuf_undo(xb); }

-static int ec_redo(char ec) +static int ec_redo(char loc, char cmd, char arg) { return lbuf_redo(xb); }

-static int ec_mark(char ec) +static int ec_mark(char loc, char cmd, char arg) {

-static int ec_substitute(char ec) +static int ec_substitute(char loc, char cmd, char arg) {

-static int ec_exec(char ec) +static int ec_exec(char loc, char cmd, char arg) {

-static int ec_make(char ec) +static int ec_make(char log, char cmd, char arg) {

-static int ec_ft(char ec) +static int ec_ft(char loc, char cmd, char arg) {

-static int ec_cmap(char ec) +static int ec_cmap(char loc, char cmd, char arg) {

-static int ec_glob(char ec) +static int ec_glob(char loc, char cmd, char arg) {

-static int ec_set(char ec) +static int ec_set(char loc, char cmd, char arg) {

+/ read ex command location / +static char ex_loc(char src, char *loc) +{

aligrudi commented 2 years ago

The updated patch.

Ali

diff --git a/ex.c b/ex.c index 7066302..ba725c1 100644 --- a/ex.c +++ b/ex.c @@ -114,50 +114,6 @@ char *ex_filetype(void) return bufs[0].ft; }

-static char ex_loc(char src, char loc); -static char ex_cmd(char src, char cmd); -static char ex_arg(char src, char *arg);

-/ read ex command location / -static char ex_loc(char src, char *loc) -{

-/ parse ex command location / +/ parse ex command addresses / static int ex_region(char loc, int beg, int end) { int naddr = 0; @@ -315,26 +271,24 @@ static int ex_region(char loc, int beg, int end) return 0; }

-static int ec_write(char ec); +static int ec_write(char loc, char cmd, char arg);

static int ex_modifiedbuffer(char *msg) { if (!lbuf_modified(xb)) return 0; if (xaw && ex_path()[0])

-static int ec_buffer(char ec) +static int ec_buffer(char loc, char cmd, char arg) {

-static int ec_quit(char ec) +static int ec_quit(char loc, char cmd, char arg) {

-static int ec_edit(char ec) +static int ec_edit(char loc, char cmd, char arg) { char msg[128];

-static int ec_read(char ec) +static int ec_read(char loc, char cmd, char arg) {

-static int ec_write(char ec) +static int ec_write(char loc, char cmd, char arg) {

-static int ec_insert(char ec) +static int ec_insert(char loc, char cmd, char arg) {

-static int ec_print(char ec) +static int ec_print(char loc, char cmd, char arg) {

-static int ec_null(char ec) +static int ec_null(char loc, char cmd, char arg) {

-static int ec_delete(char ec) +static int ec_delete(char loc, char cmd, char arg) {

-static int ec_yank(char ec) +static int ec_yank(char loc, char cmd, char arg) {

-static int ec_put(char ec) +static int ec_put(char loc, char cmd, char arg) {

-static int ec_lnum(char ec) +static int ec_lnum(char loc, char cmd, char arg) {

-static int ec_undo(char ec) +static int ec_undo(char loc, char cmd, char arg) { return lbuf_undo(xb); }

-static int ec_redo(char ec) +static int ec_redo(char loc, char cmd, char arg) { return lbuf_redo(xb); }

-static int ec_mark(char ec) +static int ec_mark(char loc, char cmd, char arg) {

-static int ec_substitute(char ec) +static int ec_substitute(char loc, char cmd, char arg) {

-static int ec_exec(char ec) +static int ec_exec(char loc, char cmd, char arg) {

-static int ec_make(char ec) +static int ec_make(char log, char cmd, char arg) {

-static int ec_ft(char ec) +static int ec_ft(char loc, char cmd, char arg) {

-static int ec_cmap(char ec) +static int ec_cmap(char loc, char cmd, char arg) {

-static int ec_glob(char ec) +static int ec_glob(char loc, char cmd, char arg) {

-static int ec_set(char ec) +static int ec_set(char loc, char cmd, char arg) {

+/ read ex command addresses / +static char ex_loc(char src, char *loc) +{

aligrudi commented 2 years ago

Another updated patch: ex_filearg() -> ex_pathexpand()

Ali

diff --git a/ex.c b/ex.c index 7066302..47c2421 100644 --- a/ex.c +++ b/ex.c @@ -114,52 +114,8 @@ char *ex_filetype(void) return bufs[0].ft; }

-static char ex_loc(char src, char loc); -static char ex_cmd(char src, char cmd); -static char ex_arg(char src, char *arg);

-/ read ex command location / -static char ex_loc(char src, char *loc) -{

-/ parse ex command location / +/ parse ex command addresses / static int ex_region(char loc, int beg, int end) { int naddr = 0; @@ -315,26 +271,24 @@ static int ex_region(char loc, int beg, int end) return 0; }

-static int ec_write(char ec); +static int ec_write(char loc, char cmd, char arg);

static int ex_modifiedbuffer(char *msg) { if (!lbuf_modified(xb)) return 0; if (xaw && ex_path()[0])

-static int ec_buffer(char ec) +static int ec_buffer(char loc, char cmd, char arg) {

-static int ec_quit(char ec) +static int ec_quit(char loc, char cmd, char arg) {

-static int ec_edit(char ec) +static int ec_edit(char loc, char cmd, char arg) { char msg[128];

-static int ec_read(char ec) +static int ec_read(char loc, char cmd, char arg) {

-static int ec_write(char ec) +static int ec_write(char loc, char cmd, char arg) {

-static int ec_insert(char ec) +static int ec_insert(char loc, char cmd, char arg) {

-static int ec_print(char ec) +static int ec_print(char loc, char cmd, char arg) {

-static int ec_null(char ec) +static int ec_null(char loc, char cmd, char arg) {

-static int ec_delete(char ec) +static int ec_delete(char loc, char cmd, char arg) {

-static int ec_yank(char ec) +static int ec_yank(char loc, char cmd, char arg) {

-static int ec_put(char ec) +static int ec_put(char loc, char cmd, char arg) {

-static int ec_lnum(char ec) +static int ec_lnum(char loc, char cmd, char arg) {

-static int ec_undo(char ec) +static int ec_undo(char loc, char cmd, char arg) { return lbuf_undo(xb); }

-static int ec_redo(char ec) +static int ec_redo(char loc, char cmd, char arg) { return lbuf_redo(xb); }

-static int ec_mark(char ec) +static int ec_mark(char loc, char cmd, char arg) {

-static int ec_substitute(char ec) +static int ec_substitute(char loc, char cmd, char arg) {

-static int ec_exec(char ec) +static int ec_exec(char loc, char cmd, char arg) {

-static int ec_make(char ec) +static int ec_make(char log, char cmd, char arg) {

-static int ec_ft(char ec) +static int ec_ft(char loc, char cmd, char arg) {

-static int ec_cmap(char ec) +static int ec_cmap(char loc, char cmd, char arg) {

-static int ec_glob(char ec) +static int ec_glob(char loc, char cmd, char arg) {

-static int ec_set(char ec) +static int ec_set(char loc, char cmd, char arg) {

-/ read ex command argument / -static char ex_arg(char src, char dst) +/ read ex command addresses / +static char ex_loc(char src, char loc) +{

-/ read an ex command and its arguments from src into dst / -static char ex_line(char src, char *dst) -{

kyx0r commented 2 years ago

Thanks for the refactor, now the patch actually stated to make sense. I have applied the changes to nextvi now. In the process I found a few memory leaks not pertaining to the patch though. See this commit: https://github.com/kyx0r/nextvi/commit/bd6dc5cf626a35d8b9743f347763a7da4b989496

The memleaks are in re_read() and ec_substitute() functions. Also take a look how I simplified your ex_arg() function. If the command is 's' I allow escapes to pass though, otherwise I check for them, a single escape will be skipped, if the bar is unescaped we stop. In your implementation I don't know why are you checking for the newline character and the quote character, led_prompt doesn't allow a new line character to be inserted. Just checking for NULL is enough. There is no reason to have this many separators, makes it ugly and complicated to understand.

aligrudi commented 2 years ago

Kyryl @.***> wrote:

Thanks for the refactor, now the patch actually stated to make sense. I have applied the changes to nextvi now. In the process I found a few memory leaks not pertaining to the patch though. See this commit: https://github.com/kyx0r/nextvi/commit/bd6dc5cf626a35d8b9743f347763a7da4b989496

The memleaks are in re_read() and ec_substitute() functions.

Thanks. Both should be fixed now.

Also take a look how I simplified your ex_arg() function. If the command is 's' I allow escapes to pass though, otherwise I check for them, a single escape will be skipped, if the bar is unescaped we stop. In your implementation I don't know why are you checking for the newline character and the quote character, led_prompt doesn't allow a new line character to be inserted. Just checking for NULL is enough. There is no reason to have this many separators, makes it ugly and complicated to understand.

Maybe now it is a good idea to support ~/.exrc?

Ali
kyx0r commented 2 years ago

Well, no I don't think .exrc is a good idea at all. The man page may say it, but generally most people will be compiling from source, so they can customize in source code. This makes it fast, efficient and secure. For anything else we already got EXINIT variable being checked, and having the bar as only separator is sufficient enough.

kyx0r commented 2 years ago

Okay, I changed my solution again, having to check for if a specific command and do some special case handling completely sucks. If someone wants to implement a new ex command they have to be aware of what the heck ex_arg() is doing (and that is not easy to understand) and continue to add into that enourmous if statement which checks for what command is going on. Nextvi has like 5 new ex commands, so obviously I would never want to have a freaking if statement with 10 ors. So instead I opt into a universal solution which will work consitently with any ex command, except the bar and slash condition. Only in that one case in the universe a special treatment is needed. But escape like "|" will remain consistant. And most of the time user will only need one escape, on other cases an | will need more escapes, but this is rather rare in practice.

My simple code:

static char *ex_arg(char *src, char *dst)
{
    while (*src == ' ' || *src == '\t')
        src++;
    while (*src && (*src != '|' || src[-1] == '\\'))
    {
        if (!strncmp(src, "\\\\|", 3))
            src += 2;
        *dst++ = *src++;
    }
    *dst = '\0';
    return src;
}

This is able to get any input/output string combination.

kyx0r commented 2 years ago

Also note that doing src[-1] is completely safe in this situation because there will always be at least 1 byte past src.