feat: use tree sitter to enhance context for code operations

Robitx commented 7 months ago

https://neovim.io/doc/user/treesitter.html#lua-treesitter

sirupsen commented 7 months ago

@Robitx even better might be to embed the code-base with a local model (or even OpenAI) to embed into the context, so it can be done cross-file

teocns commented 7 months ago

I recommend enforcing brainstorming on very large projects, such as Chromium (38 million LoC).

Disclaimer: I am not a seasoned vim user, hence my knowledge might be limited

Per my understanding, Treesitter doesn't understand the semantics or the context beyond the structure of the code. In a typical development routine of mine, most of the times the practical challenge is studying the workflow and lifecycle of components (symbols) across the [large] codebase.

Other tools (such as EasyCodeAI) have figured it out, and implemented their self-hosted codebase indexer which comes with its own limitations.

I think focusing on LSPs symbol referencing mechanism can be key; however when thinking large codebases there can be additional challenges, such as context overloading with dirty files that might not be of interest. What might be sounding plausible at first glance to me, is a multi-stage prompt operation whereas in the first, GPT is fed context upon referenced symbols, it discards those that are not-of-interest (based on a score) and only then it is feeded a cleaner context. The latest Github Copilot Chat VSCode extension works in that way.

To objectively align, we're limited by the context current GPT models can support, whose largest model adds up to 128k tokens.

teocns commented 7 months ago

To add more thoughts to the topic, I'd also like to highlight the indexing challenge. So far I have been working with two kinds of indexing mechanisms, being static and dynamic.

Looking at clangd as example:

Static indexing

it is able to pre-index an entire codebase based on compile_commands.json (in the case of clangd), which stores on disk and gives you an instantaneous symbol resolution across all the codebase.

Pros and cons are respectively: having instantaneous symbols resolution project-wise, at the expense of having to pre-index the entire codebase (takes arond ~3 hours on projects like Chromium on most powerful Mac) before being able to consume it.

Dynamic indexing

Indexes symbols dynamically based on the current buffer (by analyzing imports / includes). It is fast and handy, especially for module/sub-module work scope, but does not reach outer contexts: this is a significant drawback if we think objectively about the importance of out-of-scope context when GPT serves as a "codebase-wide assistant".

Robitx / gp.nvim

feat: use tree sitter to enhance context for code operations #51

Static indexing

Dynamic indexing