benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
MIT License
244 stars 15 forks source link

Segmentation Error in Python #265

Closed Goldziher closed 3 weeks ago

Goldziher commented 1 month ago

Describe the bug Hi, I try to split code in python and I get a segmentation fault. I tried several tree-sitter libraries, all return segmentation error. I identified the issue - it happens if language is not instantiated.

def test_code_splitter() -> None:
    from tree_sitter_cpp import language

    code_splitter = CodeSplitter(language, capacity=100)

    cpp_code = """
        #include <bits/stdc++.h>
        using namespace std;

        int main()
        {
            vector <int> arr1 = {1, 2, 3, 4};
            vector <int> arr2 = {};
            vector <float> arr3 = {1.2, 3.8, 3.0, 2.7, 6.6};

            cout << "Size of arr1: " << arr1.size() << endl;
            cout << "Size of arr2: " << arr2.size() << endl;
            cout << "Size of arr3: " << arr3.size() << endl;

            return 0;
        }
    """

    chunks = list(code_splitter.chunks(cpp_code))
    assert len(chunks) == 7

This fails - because language is not called -> language().

This works:

def test_code_splitter() -> None:
    from tree_sitter_cpp import language

    code_splitter = CodeSplitter(language(), capacity=100)

    cpp_code = """
        #include <bits/stdc++.h>
        using namespace std;

        int main()
        {
            vector <int> arr1 = {1, 2, 3, 4};
            vector <int> arr2 = {};
            vector <float> arr3 = {1.2, 3.8, 3.0, 2.7, 6.6};

            cout << "Size of arr1: " << arr1.size() << endl;
            cout << "Size of arr2: " << arr2.size() << endl;
            cout << "Size of arr3: " << arr3.size() << endl;

            return 0;
        }
    """

    chunks = list(code_splitter.chunks(cpp_code))
    assert len(chunks) == 7

If you are open to some adjustments of the python bindings, I can improve the typing and ensure this issue is handled by allowing a Language instance to be passed in.

benbrandt commented 1 month ago

Hi @Goldziher thanks for reporting. I'll give it a thought, I think I'd rather give a better error message because the language is supposed to be instantiated, but the error case is admittedly quite severe.

benbrandt commented 3 weeks ago

@Goldziher this should now return a more helpful error (and not crash python) if encountered. Thanks for reporting!