Ghostwriter equivalence tests for functions that mutate their arguments

sternj commented 1 month ago

So, equivalence test generation at the moment doesn't really account for side effects on the parameters.

For instance, consider example.py, which is the following:

from typing import List

def first_function(a: List[int]) -> List[int]:
    for i in range(len(a)):
        a[i] = a[i] + 1
    return a

def second_function(a: List[int]) -> List[int]:
    for i in range(len(a)):
        a[i] = a[i] * 2
    return a

I run hypothesis write --style pytest --equivalent example.first_function example.second_function, which gives me

import example
import typing
from hypothesis import given, strategies as st

@given(a=st.lists(st.integers()))
def test_equivalent_first_function_second_function(a: typing.List[int]) -> None:
    result_first_function = example.first_function(a=a)
    result_second_function = example.second_function(a=a)
    assert result_first_function == result_second_function, (
        result_first_function,
        result_second_function,
    )

When I run this test, it passes, even though the two functions are clearly different. This is because the same List instance, a, is passed into both functions and mutated inside of them.

I've been thinking about this problem for a while, and I don't think there's necessarily a good solution, but I think that people will want to write property-based tests for functions with side effects often enough that it's worth thinking about. I also know that these tests are just meant to be a skeleton of what good tests could look like, but I think the frequency of this sort of situation warrants some automation.

Some thoughts on how to deal with this

First, I think it's a reasonable assumption that anything that one would need to compare would be a return value of the function. This is basically (I think) what Ghostwriter assumes in writing the tests.

Second, I'd lean towards saying that no two different parameters to a given function rely on referencing the exact same underlying object. I can see this happening in a pretty contrived case, but I can't come up with a compelling reason why someone would actually do that.

The most obvious answer I can come up with is just generating a set of calls to copy.deepcopy for each parameter, and passing the originals and copies into the two separate compared functions. (I know you can compare more than two, but I think this approach works when repeated n times as well).

Any thoughts on this? I'm definitely inclined to try and write up a proof of concept, but if I'm missing something obvious I'd definitely like to know!

Zac-HD commented 1 month ago

This was a deliberate design decision, because I think the cost of writing less elegant code (e.g. inserting deepcopy(...)) by default is larger than the benefit of supporting the relatively rare and discouraged functions which mutate their arguments yet still make sense to compare by return value.

That said, I'd still be interested in adding an argument called e.g. copy_args: bool = False to the ghostwriter.equivalent() function.

Instead of exposing this through the CLI, we can try some heuristics for when to turn it on - e.g. do the usual ast.parse(inspect.getsource(fn)) trick and then look for argname.method(...) calls which are not part of an assignment or call, or argname[...] = .... We can probably get the false-alarm rate low enough to default copy_args=None and include this logic inside the function rather than in the CLI...

If you're interested in making a PR, that would be awesome! Err on the side of posting unfinished work and asking for help and feedback when that'd be useful; I and other maintainers are happy to help (to the extent that our free time lines up at least) 🙂

sternj commented 1 month ago

This seems like a principled way of doing things. I would guess that, while comparisons between functions that re-return one if its parameters is rare, I think that the case where a parameter is mutated inside a function is more common than you'd expect. A version of one of those functions that, for instance, does something like:

def first_function(buf: List[int], n: int) -> int:
   accum = 0
   for i in range(len(buf)):
      accum += n
      buf[i] = accum
  return sum(buf)

It is admittedly a tad contrived, but I've seen patterns like this before. Even ignoring the fact that the return values are objects here, I suspect that mutation is a more widespread practice than we'd hope.

I think the idea of looking for a parameter on the LHS of an ast.Assign or anywhere in an ast.Call is the best heuristic (perhaps looking for it in the LHS of only an ast.Assign where the LHS is an ast.Index or an ast.Attribute, also including ast.AugAssign and all of the other fun stuff).

I'll open a PR shortly, when I have the chance to write it out!

HypothesisWorks / hypothesis

Ghostwriter equivalence tests for functions that mutate their arguments #4113

Some thoughts on how to deal with this