blarApp / code-base-agent

Code agents for LLMs
https://blar.io
MIT License
34 stars 3 forks source link

Explore whether stack graphs may be useful in this tool #69

Open 0xdevalias opened 3 months ago

0xdevalias commented 3 months ago

Not really a feature request per se, but came across this project after it was mentioned in another issue (Ref), and figured I would create an issue to share the same info here in case it's useful.

I notice that you're already using tree-sitter, so it may not be a lot of extra effort to make use of tree-sitter-graph and/or the stack-graphs project (eg. tree-sitter-stack-graphs-javascript, etc)

From watching the demo video on linkedin + briefly skimming this repo, it looks like there might be some useful crossovers, and it may allow you to add support for a whole bunch more languages without needing to re-invent the wheel to do so.

A few notes/links/references I recently collated RE: stack graphs + related libs:

Stack Graphs (an evolution of Scope Graphs) sound like they could be really interesting/useful with regards to code navigation, symbol mapping, etc. Perhaps we could use them for module identification, or variable/function identifier naming stabilisation or similar?

  • https://github.blog/changelog/2024-03-14-precise-code-navigation-for-typescript-projects/
    • Precise code navigation is now available for all TypeScript repositories. Precise code navigation gives more accurate results by only considering the set of classes, functions, and imported definitions that are visible at a given point in your code.

      Precise code navigation is powered by the stack graphs framework. You can read about how we use stack graphs for code navigation and visit the stack graphs definition for TypeScript to learn more.

    • https://github.blog/2021-12-09-introducing-stack-graphs/
      • Introducing stack graphs

      • Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job.

      • LOTS of interesting stuff in this post..
      • As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere.

      • https://github.com/tree-sitter/tree-sitter-graph
      • Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

        To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas.

      • https://github.com/github/stack-graphs
        • Stack graphs The crates in this repository provide a Rust implementation of stack graphs, which allow you to define the name resolution rules for an arbitrary programming language in a way that is efficient, incremental, and does not need to tap into existing build or program analysis tools.

        • https://docs.rs/stack-graphs/latest/stack_graphs/
        • https://github.com/github/stack-graphs/tree/main/languages
        • This directory contains stack graphs definitions for specific languages.

        • https://github.com/github/stack-graphs/tree/main/languages/tree-sitter-stack-graphs-javascript
          • tree-sitter-stack-graphs definition for JavaScript This project defines tree-sitter-stack-graphs rules for JavaScript using the tree-sitter-javascript grammar.

          • The command-line program for tree-sitter-stack-graphs-javascript lets you do stack graph based analysis and lookup from the command line.

          • cargo install --features cli tree-sitter-stack-graphs-javascript
          • tree-sitter-stack-graphs-javascript index SOURCE_DIR
          • tree-sitter-stack-graphs-javascript status SOURCE_DIR
          • tree-sitter-stack-graphs-javascript query definition SOURCE_PATH:LINE:COLUMN
        • https://github.com/github/stack-graphs/tree/main/languages/tree-sitter-stack-graphs-typescript
          • tree-sitter-stack-graphs definition for TypeScript This project defines tree-sitter-stack-graphs rules for TypeScript using the tree-sitter-typescript grammar.

          • The command-line program for tree-sitter-stack-graphs-typescript lets you do stack graph based analysis and lookup from the command line.

      • https://dcreager.net/talks/2021-strange-loop/
  • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github
    • GitHub has developed two code navigation approaches based on the open source tree-sitter and stack-graphs library:

      • Search-based - searches all definitions and references across a repository to find entities with a given name
      • Precise - resolves definitions and references based on the set of classes, functions, and imported definitions at a given point in your code

      To learn more about these approaches, see "Precise and search-based navigation."

    • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github#precise-and-search-based-navigation
      • Precise and search-based navigation Certain languages supported by GitHub have access to precise code navigation, which uses an algorithm (based on the open source stack-graphs library) that resolves definitions and references based on the set of classes, functions, and imported definitions that are visible at any given point in your code. Other languages use search-based code navigation, which searches all definitions and references across a repository to find entities with a given name. Both strategies are effective at finding results and both make sure to avoid inappropriate results such as comments, but precise code navigation can give more accurate results, especially when a repository contains multiple methods or functions with the same name.

  • https://pl.ewi.tudelft.nl/research/projects/scope-graphs/
    • Scope Graphs | A Theory of Name Resolution

    • Scope graphs provide a new approach to defining the name binding rules of programming languages. A scope graph represents the name binding facts of a program using the basic concepts of declarations and reference associated with scopes that are connected by edges. Name resolution is defined by searching for paths from references to declarations in a scope graph. Scope graph diagrams provide an illuminating visual notation for explaining the bindings in programs.

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/11

berrazuriz1 commented 3 months ago

Hi, thanks for the tips and the reading list, I already had read all of them but LSP and LSIF are still pending.

We are definitely looking into the option of using tree-sitter-graphs. I'm currently weighing two options:

  1. Integrating 'tree-sitter-graph'. This has several pros like performance enhancement, not reinventing the wheel, and meeting industry standards. The problem is that I'm not sure about the complexity it will introduce to the code, especially handling a CLI integration with the tool.

  2. Creating my own implementation of a stack-graph in Python. Keeping everything in-house will simplify the architecture, make maintenance easier, and is a fun challenge 😃 . The downside is obvious: it will be complex and time-consuming to replicate stack-graph logic, and Python doesn't perform as well as Rust.

I think I'm going to explore the integration option, try to make it run, and if it's not too difficult, that's the route I'll go.

berrazuriz1 commented 3 months ago

I was checking LSP and seems to be a really power full tool. But it does a lot more of what I really need in this stage, because we are focused in give LLM a tool to navigate and understand code base repositories in a easy way. For that I think the best is to implement 'tree-sitter-graph' and later see in what could help LSP.

0xdevalias commented 3 months ago

We are definitely looking into the option of using tree-sitter-graphs

I think I'm going to explore the integration option, try to make it run, and if it's not too difficult, that's the route I'll go.

@berrazuriz1 I would make sure to look at both tree-sitter-graph (a more generic underlying lib), but also the more specific stack-graphs project (which is built on top of tree-sitter-graph):

eg.


The problem is that I'm not sure about the complexity it will introduce to the code, especially handling a CLI integration with the tool.

@berrazuriz1 Yeah, that's fair enough. There are theoretically ways to call/embed rust within python as well I believe, but that may end up being even more complex than shelling out to the CLI

e.g. Some random related links:


I was checking LSP and seems to be a really power full tool. But it does a lot more of what I really need in this stage

@berrazuriz1 Yeah, I think LSP is likely not relevant at this stage. I really only included the links there as part of the "why we aren't using" context.

berrazuriz1 commented 3 months ago

I think running Rust on Python is a good solution. I'm going to give it a try.