DerekStride / tree-sitter-sql

SQL grammar for tree-sitter
http://derek.stride.host/tree-sitter-sql/
MIT License
159 stars 54 forks source link

segfault #221

Closed kirillrdy closed 1 year ago

kirillrdy commented 1 year ago

Hi

somewhere between 44e36038f032eff2f36522642a8cc4351e133f83 ( good ) and 764c1bd5ea856dd7db5171783e442321dc3e041f (bad) segfault when opening sql file.

unfortunately I can not post sql file here ( prod code )

I am not familiar enough with tree-sitter, whats the easiest way to run parser though any sql file ?

I've tried adding offending sql file to test/highlight but that did't do anything

syntax highlighting:
  ✓ union.sql (13 assertions)
  ✓ query.sql (14 assertions)
  ✓ structure.sql (0 assertions) <--- this is the file that causes segfault in neovim
matthias-Q commented 1 year ago

Hi, thanks for reporting. I think the best way is, that you try to run it locally. Install the tree-sitter-cli (either via npm or cargo), clone this repo and run tree-sitter generate and then tree-sitter parse structure.sql

We have extended some regex patterns for capturing numbers, but I am not sure if this is the root cause. Would be nice if you could try to generate a simple SQL statement where this seg fault is triggered.

DerekStride commented 1 year ago

The scanner was added in that range, I suspect that it could be the root cause.

@kirillrdy would that file happen to use dollar quoted strings? It would be helpful if you could reproduce it with a similar SQL snippet that is shareable.

matthias-Q commented 1 year ago

The scanner was added before https://github.com/DerekStride/tree-sitter-sql/commit/44e36038f032eff2f36522642a8cc4351e133f83 But I also suspect that is has something to do with the scanner.c

kirillrdy commented 1 year ago

hi, running tree-sitter parse structure.sql doesn't cause segfault. I need to reproduce inside nixpkgs infrastructure, I believe they apply hardening flags eg fortify3

kirillrdy commented 1 year ago

using state of art binary search, here is minimal snipplet that causes segfault ( within nixpkgs), but doesn't segfault run tree-sitter parse

CREATE FUNCTION public.foo() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
BEGIN
END;
$$;

the issue is as $$, when I change it to as bar it works

matthias-Q commented 1 year ago

Hmm interesting. For me your snippet does not through a segfault, but it is also not parsed correctly.

Same for the one reported in https://github.com/nvim-treesitter/nvim-treesitter/issues/5646

matthias-Q commented 1 year ago

I have little to no experience with C, but I have run scanner.c through static code analysis (cppcheck) and it reports the following:

src/scanner.c:29:3: error: Found an exit path from function with non-void return type that has missing return statement [missingReturn]
  }
  ^
src/scanner.c:63:5: error: Memory leak: text_size [memleak]
    return NULL;
    ^
Active checkers: 40/565

The second error is in the function scan_dollar_string_tag(). @DerekStride @dmfay any ideas how to fix that?

[EDIT]: maybe a free(text_size); in line 63 before the return NULL; is enough?

matthias-Q commented 1 year ago

Here is patch:

diff --git a/src/scanner.c b/src/scanner.c
index 68807eb..885b613 100644
--- a/src/scanner.c
+++ b/src/scanner.c
@@ -28,6 +28,7 @@ void *tree_sitter_sql_external_scanner_destroy(void *payload) {
     state->start_tag = NULL;
   }
   free(payload);
+  return NULL;
 }

 char* add_char(char* text, int* text_size, char c, int index) {
@@ -60,6 +61,7 @@ char* scan_dollar_string_tag(TSLexer *lexer) {
     tag = add_char(tag, text_size, '$', index);
     lexer->advance(lexer, false);
   } else {
+    free(text_size);
     return NULL;
   }

Would be nice if someone could check if this works. I cannot reproduce the segfault on my machine.

mortezadadgar commented 1 year ago

I can reproduce it locally and it seems to be related to fortify3

$ rust-gdb tree-sitter
Reading symbols from tree-sitter...
Reading symbols from /usr/lib/debug//usr/bin/tree-sitter.debug...
(gdb) run parse db_create.sql
Starting program: /usr/bin/tree-sitter parse db_create.sql
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff7d826c0 (LWP 11265)]
*** buffer overflow detected ***: terminated

Thread 1 "tree-sitter" received signal SIGABRT, Aborted.
0x00007ffff7e0e5fc in ?? () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7e0e5fc in ?? () from /lib64/libc.so.6
#1  0x00007ffff7dbf516 in raise () from /lib64/libc.so.6
#2  0x00007ffff7da8877 in abort () from /lib64/libc.so.6
#3  0x00007ffff7da9804 in ?? () from /lib64/libc.so.6
#4  0x00007ffff7e9fc1b in __fortify_fail () from /lib64/libc.so.6
#5  0x00007ffff7e9e3f6 in __chk_fail () from /lib64/libc.so.6
#6  0x00007ffff7e9dc66 in __strcpy_chk () from /lib64/libc.so.6
#7  0x00007ffff7f9555b in ts_parser_parse () from /usr/lib64/libtree-sitter.so.0
#8  0x00005555558f1503 in tree_sitter::Parser::parse_with<&[u8], tree_sitter::{impl#2}::parse::{closure_env#0}<&alloc::vec::Vec<u8, alloc::alloc::Global>>> (self=0x7ffffffdc7a8,
    callback=0x7ffffffdbfc0, old_tree=...) at lib/binding_rust/lib.rs:537
#9  0x00005555558f17ce in tree_sitter::Parser::parse<&alloc::vec::Vec<u8, alloc::alloc::Global>> (self=0x7ffffffdc7a8, text=0x7ffffffdc7c0, old_tree=...) at lib/binding_rust/lib.rs:465
#10 0x00005555558723a5 in tree_sitter_cli::parse::parse_file_at_path (language=..., path=..., edits=0x7fffffffc400, max_path_length=13, output=tree_sitter_cli::parse::ParseOutput::Normal,
    print_time=false, timeout=0, debug=false, debug_graph=false, cancellation_flag=...) at cli/src/parse.rs:81
#11 0x000055555561cd9c in tree_sitter::run () at cli/src/main.rs:434
#12 0x000055555561711c in tree_sitter::main () at cli/src/main.rs:20

for your reference parsed this file:

DO
$$
BEGIN
  IF NOT EXISTS (
    SELECT 
    FROM   pg_catalog.pg_roles
    WHERE  rolname = 'store_user') THEN

    CREATE ROLE store_user WITH LOGIN;
  END IF;
END
$$;
matthias-Q commented 1 year ago

@mortezadadgar does my patch fix the issue?

mortezadadgar commented 1 year ago

No your patch is unrelated to this issue, according to gdb it's related to one of strcpy sadly it's doesn't tell which line is causing overflow btw do you know why make generate takes forever to be done? it's really hard to find the source of issue this way

matthias-Q commented 1 year ago

@mortezadadgar @kirillrdy I was able to reproduce it. Seems like using -D_FORTIFY_SOURCE=2 works. If am reading the nix cc wrapper scripts correctly, you can compile it with fortify instead of fortify3.

https://github.com/NixOS/nixpkgs/blob/master/pkgs/build-support/cc-wrapper/add-hardening.sh

On the flip side, I would really like to know whats causing this. My patch however has no effect.

matthias-Q commented 1 year ago

make generate takes a long time because this parser has a really high state count.

mortezadadgar commented 1 year ago

I investigated around this issue a little bit, i'm quit sure about it but it might be tree-sitter issue as it turns out returning '$$' if we add a return "$$" before this line: https://github.com/DerekStride/tree-sitter-sql/blob/b24e6759c0c05875b4929cc868bb6494ff797b83/src/scanner.c#L71 buffer overflow can be reproduced with fortify3 without involvement of any strcpy usage. and gdb references the overflow from tree-sitter itself so the chances that it's a not related to scanner is pretty high

matthias-Q commented 1 year ago

I took a look at the add_char function and there are some bounds checks missing. I have tried to add them but it does not solve the issue:

diff --git a/src/scanner.c b/src/scanner.c
index 68807eb..8237ddb 100644
--- a/src/scanner.c
+++ b/src/scanner.c
@@ -30,18 +30,17 @@ void *tree_sitter_sql_external_scanner_destroy(void *payload) {
   free(payload);
 }

-char* add_char(char* text, int* text_size, char c, int index) {
+char* add_char(char* text, size_t* text_size, char c, int index) {
   if (text == NULL) {
     text = malloc(sizeof(char) * MALLOC_STRING_SIZE);
     *text_size = MALLOC_STRING_SIZE;
-    text[index] = c;
-    text[index + 1] = '\0';
   }

-  if (index + 1 == *text_size) {
+  // will break when indexes advances more than MALLOC_STRING_SIZE
+  if (index + 1 >= *text_size) {
     *text_size += MALLOC_STRING_SIZE;
     char* tmp = malloc(*text_size * sizeof(char));
-    strcpy(tmp, text);
+    strncpy(tmp, text, *text_size);
     free(text);
     text = tmp;
   }
@@ -54,12 +53,13 @@ char* add_char(char* text, int* text_size, char c, int index) {
 char* scan_dollar_string_tag(TSLexer *lexer) {
   char* tag = NULL;
   int index = 0;
-  int* text_size = malloc(sizeof(int));
+  size_t* text_size = malloc(sizeof(size_t));
   *text_size = 0;
   if (lexer->lookahead == '$') {
     tag = add_char(tag, text_size, '$', index);
     lexer->advance(lexer, false);
   } else {
+    free(text_size);
     return NULL;
   }

I suspect that the compiler adds code around these unchecked code parts and they crash when -D_FORTIFY_SOURCE=3 is set.

matthias-Q commented 1 year ago

@kirillrdy @mortezadadgar can you check the code changes I submitted in #226 On my machine, this does not segfault anymore.