Segmentation fault or undefined behavior is possible.

kyx0r commented 3 years ago

Hello @aligrudi I have found a way to break vi by looking at files with all kinds of weird encodings and characters, and basically somehow it is possible for the function uc_code(s); in many places to read past the malloc buffer. The good news is though that it's only a off by 1 error from what I saw on valgrind. Valgrind did not find any bad writes, so it may never crash, but the screen surely does show random artifacts. Overall I don't think it's that worth the hassle ( I cound not solve this myself) but maybe you will have a quick way to place a check somewhere? Generally I think uc_chop may be the root cause but I digress.

==31038== Invalid read of size 1
==31038==    at 0x11BF4C: uc_code (in /usr/bin/vi)
==31038==    by 0x11C5C2: uc_shape (in /usr/bin/vi)
==31038==    by 0x11AD30: led_render (in /usr/bin/vi)
==31038==    by 0x11B7F6: led_print (in /usr/bin/vi)
==31038==    by 0x112A78: vi_drawrow (in /usr/bin/vi)
==31038==    by 0x112BF8: vi_drawagain (in /usr/bin/vi)
==31038==    by 0x110546: main (in /usr/bin/vi)
==31038==  Address 0x49b1268 is 0 bytes after a block of size 104 alloc'd
==31038==    at 0x48D080F: malloc (vg_replace_malloc.c:307)
==31038==    by 0x11772F: lbuf_replace (in /usr/bin/vi)
==31038==    by 0x117DFF: lbuf_rd (in /usr/bin/vi)
==31038==    by 0x116CB6: ec_edit (in /usr/bin/vi)
==31038==    by 0x117382: ex_init (in /usr/bin/vi)
==31038==    by 0x110403: main (in /usr/bin/vi)
==31038==
==31038== Invalid read of size 1
==31038==    at 0x11BF4C: uc_code (in /usr/bin/vi)
==31038==    by 0x11C5C2: uc_shape (in /usr/bin/vi)
==31038==    by 0x11AD4A: led_render (in /usr/bin/vi)
==31038==    by 0x11B7F6: led_print (in /usr/bin/vi)
==31038==    by 0x112A78: vi_drawrow (in /usr/bin/vi)
==31038==    by 0x112BF8: vi_drawagain (in /usr/bin/vi)
==31038==    by 0x110546: main (in /usr/bin/vi)
==31038==  Address 0x49b1268 is 0 bytes after a block of size 104 alloc'd
==31038==    at 0x48D080F: malloc (vg_replace_malloc.c:307)
==31038==    by 0x11772F: lbuf_replace (in /usr/bin/vi)
==31038==    by 0x117DFF: lbuf_rd (in /usr/bin/vi)
==31038==    by 0x116CB6: ec_edit (in /usr/bin/vi)

This kind of stuff but not only for uc_shape.

vi_test.zip

Here is the offending file to reproduce the bug. I had to zip because github is dum dum. Actually, this file is from mpv build source, it has some python at the top but the offending part is at the bottom of the file, which I believe is a binary blob, but still I don't think vi should just give up on that.

kyx0r commented 3 years ago

I does say something about latin-1 encoding, So I think this is because of 128 bit codepoint??? https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29#Number_of_symbols,_letters_and_control_codes

aligrudi commented 3 years ago

Hello,

Kyryl notifications@github.com wrote:

Hello @aligrudi I have found a way to break vi by looking at files with all kinds of weird encodings and characters, and basically somehow it is possible for the function uc_code(s); in many places to read past the malloc buffer. The good news is though that it's only a off by 1 error from what I saw on valgrind. Valgrind did not find any bad writes, so it may never crash, but the screen surely does show random artifacts. Overall I don't think it's that worth the hassle ( I cound not solve this myself) but maybe you will have a quick way to place a check somewhere? Generally I think uc_chop may be the root cause but I digress.


==31038== Invalid read of size 1
==31038==    at 0x11BF4C: uc_code (in /usr/bin/vi)
==31038==    by 0x11C5C2: uc_shape (in /usr/bin/vi)
==31038==    by 0x11AD30: led_render (in /usr/bin/vi)
==31038==    by 0x11B7F6: led_print (in /usr/bin/vi)
==31038==    by 0x112A78: vi_drawrow (in /usr/bin/vi)
==31038==    by 0x112BF8: vi_drawagain (in /usr/bin/vi)
==31038==    by 0x110546: main (in /usr/bin/vi)
==31038==  Address 0x49b1268 is 0 bytes after a block of size 104 alloc'd
==31038==    at 0x48D080F: malloc (vg_replace_malloc.c:307)
==31038==    by 0x11772F: lbuf_replace (in /usr/bin/vi)
==31038==    by 0x117DFF: lbuf_rd (in /usr/bin/vi)
==31038==    by 0x116CB6: ec_edit (in /usr/bin/vi)
==31038==    by 0x117382: ex_init (in /usr/bin/vi)
==31038==    by 0x110403: main (in /usr/bin/vi)
==31038==
==31038== Invalid read of size 1
==31038==    at 0x11BF4C: uc_code (in /usr/bin/vi)
==31038==    by 0x11C5C2: uc_shape (in /usr/bin/vi)
==31038==    by 0x11AD4A: led_render (in /usr/bin/vi)
==31038==    by 0x11B7F6: led_print (in /usr/bin/vi)
==31038==    by 0x112A78: vi_drawrow (in /usr/bin/vi)
==31038==    by 0x112BF8: vi_drawagain (in /usr/bin/vi)
==31038==    by 0x110546: main (in /usr/bin/vi)
==31038==  Address 0x49b1268 is 0 bytes after a block of size 104 alloc'd
==31038==    at 0x48D080F: malloc (vg_replace_malloc.c:307)
==31038==    by 0x11772F: lbuf_replace (in /usr/bin/vi)
==31038==    by 0x117DFF: lbuf_rd (in /usr/bin/vi)
==31038==    by 0x116CB6: ec_edit (in /usr/bin/vi)

Neatvi assumes that input files are in UTF-8. It decides how many bytes it should read for a UTF-8 character, based on its first byte (see uc_code()).

The issue may be fixed by updating uc_code as follows, but I am not sure how the performace would suffer. Also note that even if it handles badly encoded UTF-8 files, it cannot edit binary files (they may contain zero bytes).

int uc_code(char s) { int c = (unsigned char) s[0]; if (~c & 0xc0) / ASCII or invalid */ return c; if (~c & 0x20) return ((c & 0x1f) << 6) | (s[1] & 0x3f); if (~c & 0x10 && s[1]) return ((c & 0x0f) << 12) | ((s[1] & 0x3f) << 6) | (s[2] & 0x3f); if (~c & 0x08 && s[1] && s[2]) return ((c & 0x07) << 18) | ((s[1] & 0x3f) << 12) | ((s[2] & 0x3f) << 6) | (s[3] & 0x3f); return 0; }

Ali

kyx0r commented 3 years ago

Thank you. I will try running this patch in my version. I also thought about adding null checks, but I did not try it because for whatever reason I considered that this wasn't uninitialized memory read but a past the buffer memory read? Generally this would not work against those types of bugs, but now I see that it was not right in my assumption.

I don't think the performance would suffer too badly from this, especially if the code is written like this:

/* the unicode codepoint of the given utf-8 character */
int uc_code(char *s)
{
        int c = (unsigned char) s[0];
        int cnot = ~c;
        if (cnot & 0xc0)        /* ASCII or invalid */
                return c;
        if (cnot & 0x20)
                return ((c & 0x1f) << 6) | (s[1] & 0x3f);
        if (s[1])
        {
                if (cnot & 0x10)
                        return ((c & 0x0f) << 12) | ((s[1] & 0x3f) << 6) | (s[2] & 0x3f);
                if (cnot & 0x08 && s[2])
                        return ((c & 0x07) << 18) | ((s[1] & 0x3f) << 12) |
                                ((s[2] & 0x3f) << 6) | (s[3] & 0x3f);
        }
        return 0;
}

kyx0r commented 3 years ago

@aligrudi Thanks for the help again. As you can see I like to take a more streamlined approach to writing code, assuming 0 optimizations from the compiler (while imagining the assembly code in my head), so that the compiler can focus on doing the "hard" optimizations that actually matter. I would probably merge this patch for safe than sorry reasons, but I leave this one completely up to you. Closing this issue.

aligrudi / neatvi

Segmentation fault or undefined behavior is possible. #25