for-just-we / CppCodeAnalyzer

A tool based on python to parse C/C++ code into code property graph
13 stars 0 forks source link

About CppCodeAnalyzer

It is a parsing tool based on python for C/C++ to construct code property graph, which is the python version of CppCodeAnalyzerJava, most of functions of CppCodeAnalyzer are similar to Joern, the differences are that:

Graph 1

graph LR
EmptyCondition --> A[auto p: vec]
A --> B[xxx]
B --> EmptyCondition
EmptyCondition --> Exit

The pipeline of CppCodeAnalyzer is similar to Joern, which could be illustrated as:

graph LR
AntlrAST  --Transform --> AST -- control flow analysis --> CFG 
CFG -- dominate analysis --> CDG
CFG -- symbol def use analysis --> UDG
UDG -- data dependence analysis --> DDG

If you want more details, coule refer to Joern工具工作流程分析

Usage

The testfile in directionary test/mainToolTests illustrated the progress of each module, you could refer to those test cases to learn how to use API in CppCodeAnalyzer.

Environment:

Used as python package:

Our motivations

Challenges

s1: memset(source, 100, 'A');
s2: source[99] = '\0';
s3: memcpy(data, source, 100);

Also, our tool is much more slower than Joern, normally parsing a file in SARD dataset needs 20 - 30 seconds, so we recommand dump output CPG into json format first if you need to train a model. The Java version CppCodeAnalyzerJava is much more faster, if you prefer fast analysis you could use Java version.

configuration

calleeInfos.json stores APIs which define or use variable of pointer type, you can use json package to load these callee infos and set ASTDefUseAnalyzer.calleeInfos according to your own preference when analysing use-def information of each code line.

Note that calleeInfos.json is important to parse data dependence, or you would lose data dependence of pointer variable generated by API (such as memcpy), you can load like

import json
from CppCodeAnalyzer.mainTool.CPG import initialCalleeInfos, CFGToUDGConverter, ASTDefUseAnalyzer

calleeInfs = json.load(open("path to calleeInfos.json", 'r', encoding='utf-8'))
calleeInfos = initialCalleeInfos(calleeInfs)

converter: CFGToUDGConverter = CFGToUDGConverter()
astAnalyzer: ASTDefUseAnalyzer = ASTDefUseAnalyzer()
astAnalyzer.calleeInfos = calleeInfos
converter.astAnalyzer = astAnalyzer

remember set astAnalyzer.calleeInfos = calleeInfos and converter.astAnalyzer = astAnalyzer to load calleeInfos

Extra Tools

The package extraTools contains some preprocess code for vulnerability detectors IVDetect, SySeVR and DeepWuKong. The usage could refer to file in test/extraToolTests

References

Yamaguchi, F. , Golde, N. , Arp, D. , & Rieck, K. . (2014). Modeling and Discovering Vulnerabilities with Code Property Graphs. IEEE Symposium on Security and Privacy. IEEE.

Li Y , Wang S , Nguyen T N . Vulnerability Detection with Fine-grained Interpretations. 2021.

SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing, 2021, PP(99):1-1.

Cheng X , Wang H , Hua J , et al. DeepWukong[J]. ACM Transactions on Software Engineering and Methodology (TOSEM), 2021.