codefuse-ai / CodeFuse-Query

Query-Based Code Analysis Engine
Apache License 2.0
195 stars 22 forks source link
code-analysis large-language-models query-language

CodeFuse-Query: A Data-Centric Static Code Analysis System

stars forks License: MIT Open Issues Release Download VSCode Plugin GDL script file checker

[中文](README_cn.md) | **English**

What is CodeFuse-Query?

In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design. CodeFuse-Query reimagines code analysis as a data computation task, support scanning over 10 billion lines of code daily and more than 300 different tasks. It optimizes resource utilization, prioritizes data reusability, applies incremental code extraction, and introduces tasks types specially for Code Change, underscoring its domain-optimized design. The system's logic-oriented facet employs Datalog, utilizing a unique two-tiered schema, COREF, to convert source code into data facts. Through Godel, a distinctive language, CodeFuse-Query enables formulation of complex tasks as logical expressions, harnessing Datalog's declarative prowess.

Overall, the CodeFuse-Query platform is divided into three main parts: code data model, code query DSL, and platform productization services.

Code Data Model: COREF

We have defined a code data and standardization model: COREF, which requires all code to be converted to this model through various language extractors. COREF mainly contains the following information: COREF = AST (Abstract Syntax Tree) + ASG (Abstract Semantic Graph) + CFG (Control Flow Graph) + PDG (Program Dependency Graph) + Call Graph + Class Hierarchy + Documentation (Documentation/Comments) Note: Since the computation difficulty of each type of information varies, not all languages' COREF information includes all the above. The basic information mainly consists of AST, ASG, Call Graph, Class Hierarchy, and Documentation, while other information (CFG and PDG) is still under construction and will be gradually supported.

Code Query DSL

Based on the generated COREF code data, CodeFuse-Query uses a custom DSL language called Gödel for queries to meet code analysis needs. Gödel is a logical reasoning language based on the logical reasoning language Datalog, which derives new facts through "facts" and "rules". Gödel is also a declarative language, which, compared to imperative programming, focuses more on describing "what is needed" and leaves the implementation to the computation engine. Since the code has been transformed into relational data (COREF data is stored in the form of relational data tables), one might wonder why not use SQL directly or use an SDK, but instead learn a new DSL language. The reason is that Datalog has monotonicity and termination properties, meaning that Datalog sacrifices some expressive power, and Gödel inherits this characteristic.

Language Status COREF Model Node Count
Java Mature 162
XML Mature 12
TS/JS Mature 392
Go Mature 40
OC/C++ Beta 53/397
Python3 Beta 93
Swift Beta 248
SQL Beta 750
Properties Beta 9

Note: The maturity level of the language status is determined based on the types of information contained in COREF and the actual implementation. Except for OC/C++, all languages support complete AST information and Documentation, and in the case of Java, COREF for Java also supports ASG, Call Graph, Class Hierarchy, and some CFG information.

Quick Start

Installation, Configuration, and Running

Documentation

Tutorial

Directory Structure Description

Some Notes on the Scope of Open Source

As of now, it is not possible to build an executable program from the source code because not all modules have been made open-source in this release, and missing modules will be released over the next year. Nevertheless, to ensure a complete experience, we have released complete installation packages for download, please see the Release page. Regarding the openness of languages, you can refer to the table below:

Language Data Modeling Open Source Data Core Open Source Maturity
Python Y Y RELEASE
Java Y Y RELEASE
JavaScript Y Y RELEASE
Go Y Y RELEASE
XML Y Y RELEASE
Cfamily Y Y BETA
SQL Y Y BETA
Swift N N BETA
Properties Y Y BETA

Contact Us

WeChat User Group Image

Star History

Star History Chart