Exafunction / codeium-parse

A command line tool for parsing code syntax
MIT License
94 stars 4 forks source link
command-line command-line-interface command-line-tool syntax-tree tree-sitter

Codeium


Discord Twitter Follow License built with Codeium

Visual Studio JetBrains Open VSX Google Chrome

codeium-parse

A command line tool for parsing code syntax

This repository contains a binary built with tree-sitter that lets you:

In particular, this repo provides a binary prepackaged with:

Contributions are welcome and we encourage using this tool for any applications that involve code syntax analysis. For example, these queries are used by Codeium Search to index code locally for repo-wide semantic search. If you use Codeium Search, adding queries for your language here will enable it to work better on your own code!

Example

(Requires fd and jq.)

# Print all names and arguments from function definitions.
fd -e js \
  | xargs -i ./parse -quiet -use_tags_query -json -json_include_path -file '{}' \
  | jq -r '.
    | select(.captures."definition.function" != null)
    | .file + ":" + .captures.name[0].text + .captures."codeium.parameters"[0].text'
# Output:
# examples/example.js:add(a, b)

Getting started

$ ./download_parse.sh
$ ./parse -file examples/example.js -named_only
program [0, 0] - [4, 0] "// Adds two numbers.\n…"
  comment [0, 0] - [0, 20] "// Adds two numbers."
  function_declaration [1, 0] - [3, 1] "function add(a, b) {\n…"
    name: identifier [1, 9] - [1, 12] "add"
    parameters: formal_parameters [1, 12] - [1, 18] "(a, b)"
      identifier [1, 13] - [1, 14] "a"
      identifier [1, 16] - [1, 17] "b"
    body: statement_block [1, 19] - [3, 1] "{\n…"
      return_statement [2, 4] - [2, 17] "return a + b;"
        binary_expression [2, 11] - [2, 16] "a + b"
          left: identifier [2, 11] - [2, 12] "a"
          right: identifier [2, 15] - [2, 16] "b"
$ ./parse -file examples/example.js -use_tags_query -json | jq ".captures.doc[0].text"
"// Adds two numbers."

Support status

Queries

Queries try to follow the conventions established by tree-sitter.

Most captures also include documentation as @doc. @definition.function and @definition.method also capture @codeium.parameters.

Top-level capture Python TypeScript JavaScript Go Java C++ PHP Ruby C#
@definition.class
@definition.function ✓[^3] N/A N/A N/A
@definition.method ✓[^1] ✓[^3] ✓[^1]
@definition.constructor N/A
@definition.interface N/A N/A N/A
@definition.namespace N/A N/A N/A N/A N/A
@definition.module N/A N/A N/A N/A N/A N/A
@definition.type N/A N/A N/A N/A N/A
@definition.constant
@definition.enum N/A
@definition.import N/A
@definition.include N/A N/A N/A N/A N/A N/A N/A
@definition.package N/A N/A N/A N/A N/A N/A N/A
@reference.call
@reference.class ✓[^2]
Language Supported injections
Vue JavaScript, TypeScript
HTML JavaScript

[^1]: Currently functions and methods are not distinguished. [^2]: Function calls and class instantiation are indistinguishable in Python. [^3]: Function and method signatures are captured individually in TypeScript. Therefore, the @doc capture may not exist on all nodes.

Want to write a query for a new language? tags.scm and other queries in each language's tree-sitter repository, like tree-sitter-javascript, are a good place to start.

Query predicates

$ ./parse -supported_predicates
#eq?/#not-eq?
    (#eq? <@capture|"literal"> <@capture|"literal">)
    Checks if two values are equal.

#has-parent?/#not-has-parent?
    (#has-parent? @capture node_type...)
    Checks if @capture has a parent node of any of the given types.

#has-type?/#not-has-type?
    (#has-type? @capture node_type...)
    Checks if @capture has a node of any of the given types.

#lineage-from-name!
    (#lineage-from-name! "literal")
    If the name captures scopes, split by "literal" and retain the last element
    as the name. The other elements are appended to the lineage.

#match?/#not-match?
    (#match? @capture "regex")
    Checks if the text for @capture matches the given regular expression.

#select-adjacent!
    (#select-adjacent! @capture @anchor)
    Selects @capture nodes contiguous with @anchor (all starting and ending on
    adjacent lines).

#set!
    (#set! key <@capture|"literal">)
    Store metadata as a side effect of a match.

#strip!
    (#strip! @capture "regex")
    Removes all matching text from all @capture nodes.

Need a predicate which hasn't been implemented? File an issue! We try to use predicates from nvim-treesitter.

Grammars

$ ./parse -supported_languages
ada
c
cpp
csharp
css
dart
go
hcl
html
java
javascript
json
julia
kotlin
latex
markdown
ocaml
ocaml_interface
perl
php
protobuf
python
ruby
rust
shell
svelte
swift
toml
tree_sitter_query
tsx
typescript
vue
yaml

Looking for support for another language? File an issue with a link to the repo that contains the grammar.

Contributing

Pull requests are welcome. For non-issue discussions about codeium-parse, join our Discord.

Adding and testing queries