Closed esohel30 closed 2 months ago
Cool, this is a good start.
A few comments:
str
and the return value of most functions is also str
generate_complex_expression
is not truly recursive. For instance, generate_if_statement
uses generate_assignment
which itself uses generate_variable_expression
. It will never generate a if
condition that contains a function call, for instance. Do you have ideas on how to improve that? I can help if needed.indent
, just use textwrap.indent
from the standard librarysink(source())
(where source
and sink
are one of the 6 functions). Pysa will likely find the flow 99.99% of the time since the whole code before that doesn't really matter. What we need is the source to actually flow through the randomly generated code. Then, the hard part is to either: only generate code where the flow is valid, or somehow have a way to execute the code to verify if the flow is valid (to see if Pysa agrees with it).Enhanced Pysa fuzzer by adding type annotations, ensuring variables are defined before use, making expression generation truly recursive, and using textwrap.indent for better code formatting. Added a defined_variables set to track declared variables, improving code validity. Despite these improvements, the fuzzer is still far from perfect and requires further refinement to enhance code generation diversity and flow validation. Will continue to work on it extensively! Might even approach it in a different manner now that I have been playing around with a it for a bit.
Cool, another round of feedback:
sink_var
is always the result of calling get_next_variable
called after generating the expression, so that variable is never assigned. To fix that, you need the whole code generation to know the source_var
and sink_var
and have an assignment at some point.x = x
since to_var
is created before the right hand side expression is generatedget_next_variable
runs out of variable? We should probably use variable names such as a
, .., z
, aa
, .., az
, etc.defined_variables
and current_var_index
)? This is a problem if we want to generate several tests by calling generate_source_to_sink
multiple times. A solution would be to use a class with attributes. All the functions to generate expressions could be methods of that class.generate_arithmetic_expression
is never called, and generate_assignment
only uses generate_variable_expression
. This will never generate x = foo()
or x = y + z + 1
. Here is an idea: you should first differentiate between expressions (return a value) and statements. The current generate_complex_expression
generates statements, not expressions. Then, the actual generate_expression
function should be recursive and accept a maximum depth. Pick a number, then if it's 0: generate a constant, if it's 1, generate an addition, by calling generate_expression
recursively, if it's 2, generate a substraction, etc. if it's some number N, generate a function call, etc.I believe @alexkassil also has a different idea to generate code that always has a valid flow, feel free to ask him if you are interested.
My feedback:
'a' + 1
is not valid python code.generate_xxx
(e.g, generate_while_loop
, generate_for_loop
, etc.) should be recursive so it can have nested statemenets. Right now we only generate while loops that have {curr_var} += {prev_var}\ncounter += 1
. It would be more interesting to have nested while loops, while loops with if conditions, etc.Hey @esohel30 , so the idea for this project is to automatically find false negatives - ie flows that should be security issues, but for whatever reason pysa doesn't find it.
The way to do this is to generate increasingly complex valid flows for pysa to find. Everything generated should be a valid security issue.
https://github.com/facebook/pyre-check/tree/main/source/interprocedural_analyses/taint/test/integration take a look at the tests here (in the .py files).
Here's an example: https://github.com/facebook/pyre-check/blob/main/source/interprocedural_analyses/taint/test/integration/source_sink_flow.py
from builtins import _test_sink, _test_source
def bar():
return _test_source()
def qux(arg):
_test_sink(arg)
def bad(ok, arg):
qux(arg)
def some_source():
return bar()
def match_flows():
x = some_source()
bad(5, x)
One issue in this file is the flow in match_flows() -> bad()
One way to always generate valid issues is to start with _test_sink(_test_source())
, and then pick operations that make it so the flow is one more hop away.
For example, let's say you add 3 functionalities of mutations to the fuzzer:
And now you randomly pick from those 3 elements 4 times to get [1, 2, 3, 2].
Applying those mutations step by step gets you:
f1():
x = test_source()
_test_sink(x)
def f2():
x = test_source()
f2(x)
def f1(x):
_test_sink(x)
def f2(cond):
x = test_source()
if cond:
pass
else:
f2(x)
def f1(x):
_test_sink(x)
def f3(x, cond):
x = test_source()
f2(x, cond)
def f2(x, cond):
if cond:
pass
else:
f2(x)
def f1(x):
_test_sink(x)
Continuing adding more and more single hop transformations that preserve the flow will make the fuzzer be able to generate all the valid flows present in https://github.com/facebook/pyre-check/tree/main/source/interprocedural_analyses/taint/test/integration - I think for simplicity do not worry about any modelling other than _test_source
and _test_sink
@arthaud has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@arthaud merged this pull request in facebook/pyre-check@99a07a24e31679d69aab008e66da27fe6077ba4d.
Congrats and well done @esohel30 !
Thanks for the hard work! We have finally merged this and found a few false negatives (see upcommit commits).
Draft PR for the pysa fuzzer project 🚀