feat: rewrite the scanner in C

razzeee commented 1 year ago

What's the reason for this?

amaanq commented 1 year ago

So projects using a pure C toolchain can leverage embedding this more easily - and for ease of compilation/portability across various platforms

razzeee commented 1 year ago

For context, this will make the scanner unmaintainable for me, as even doing it in cpp, was hard enough - handling or hunting down stuff in C will be a nightmare.

So that needs to be weighted against embedding pains.

amaanq commented 1 year ago

I can take care of maintaining the scanner if you'd like - just shoot a ping if any issue arises

I can also add a CI workflow to fuzz the scanner to check for any crashes/undefined behavior

razzeee commented 1 year ago

Both would be greatly appreciated

amaanq commented 1 year ago

Forgot typeof is a gnu/gcc extension not supported in the C standard... (and why ci was failing)

razzeee commented 1 year ago

Can you maybe explain, what that fuzzer output means?

amaanq commented 1 year ago

The fuzzer generates a random dictionary with a seed of input (biasing) which is obtained by just jq'ing grammar.json for string values. It then runs thousands of tests feeding in random bytes of input with the seed biasing so that more likely bugs/errors are to be hit (for the scanner too). The fuzzer then hooks all function calls and looks for several things - leaked memory, segfaults/memory access violations, deadlocks/endless loops, buffer overflows, and more. It is very very helpful for writing c/c++ code for an external scanner.

Any error outputs would be obvious - here are some common ones

You can trigger these by double freeing scanner at the end, printing out scanner->runback.data[0] at the end (after freeing the memory), or not freeing scanner at the end

Here is a simpler script to play with:

#!/usr/bin/env sh

set -eu

ROOT_DIR="fuzzer"

# XXX: ensure shift below is consistent with number of args here
LANG=$1
TIMEOUT=$2
MAX_TOTAL_TIME=$3
CPP=$4

shift 4

# if scanner = scanner.cc then XFLAG = c++ else XFLAG = c
if [ "$CPP" = "cpp" ]; then
        SCANNER="scanner.cc"
        XFLAG="c++"
else
        SCANNER="scanner.c"
        XFLAG="c"
fi

export CFLAGS="$(pkg-config --cflags --libs tree-sitter) -O0 -g -w"

JQ_FILTER='.. | if .type? == "STRING" or (.type? == "ALIAS" and .named? == false) then .value else null end'

build_dict() {
        jq "$JQ_FILTER" <src/grammar.json |
                grep -v "\\\\" | grep -v null |
                iconv -c -f UTF-8 -t ASCII//TRANSLIT |
                awk '!/^""$/' >"$ROOT_DIR/dict"
}

build_fuzzer() {
        cat <<END | clang -fsanitize=fuzzer,address $CFLAGS -lstdc++ -g -x $XFLAG - src/$SCANNER src/parser.c $@ -o $ROOT_DIR/fuzzer
#include <stdio.h>
#include <stdlib.h>
#include <tree_sitter/api.h>

#ifdef __cplusplus
extern "C"
#endif
TSLanguage *tree_sitter_$LANG();

#ifdef __cplusplus
extern "C"
#endif
int LLVMFuzzerTestOneInput(const uint8_t * data, const size_t len) {
  // Create a parser.
  TSParser *parser = ts_parser_new();

  // Set the parser's language.
  ts_parser_set_language(parser, tree_sitter_$LANG());

  // Build a syntax tree based on source code stored in a string.
  TSTree *tree = ts_parser_parse_string(
    parser,
    NULL,
    (const char *)data,
    len
  );
  // Free all of the heap-allocated memory.
  ts_tree_delete(tree);
  ts_parser_delete(parser);
  return 0;
}
END
}

generate_fuzzer() {
        tree-sitter generate
}

makedirs() {
        mkdir -p "$ROOT_DIR"
        mkdir -p "$ROOT_DIR/out"
}

makedirs
# generate_fuzzer

build_dict
build_fuzzer $@
cd "$ROOT_DIR"
./fuzzer -dict=dict -timeout=$TIMEOUT -max_total_time=$MAX_TOTAL_TIME out/

and just run ./fuzz.sh elm 1 10 c args: language, max timeout per single test (they're very fast so 1s timeout is fair), time to fuzz for, and scanner language (c or cpp)

razzeee commented 1 year ago

So it should be failing if it actually hits something right?

Wondering if that log should have colors to be better parseable, no idea why github doesn't.

razzeee commented 1 year ago

I also don't get why it checks out twice

amaanq commented 1 year ago

yes, the ci action would fail, and erroneous output is colored.

I'm not sure what you mean by checks out twice

razzeee commented 1 year ago

Nevermind, probably just something weird about the mobile version I looked at at the time.

FYI the Test full builds are not failing and I usually compare the reported numbers with the presious run(s) on main.

This branch: Total parses: 19869; successful parses: 19796; failed parses: 73; success percentage: 99.63%

Main: Total parses: 19618; successful parses: 19545; failed parses: 73; success percentage: 99.63%

So it seems, like the parser is working as expected, nice work.

amaanq commented 1 year ago

yep I noticed they're the same when testing - thanks for merging!

razzeee commented 7 months ago

hey @amaanq hope your doing good, it was brought to my attention, that this happened in a downstream https://github.com/zed-industries/zed/pull/2794 do you have thoughts about that?

amaanq commented 7 months ago

Hey, I'm great, hope you're well as well. I checked that out, I can't really see immediately what would cause that issue - every helper function is marked static so it won't collide with other symbols. I'll try building Zed later today to see what it would be

elm-tooling / tree-sitter-elm

feat: rewrite the scanner in C #136