STAMP-project / dspot

Automatically detect and generate missing assertions for Junit test cases (also known as test amplification)
https://dspot-demo.stamp-project.eu/
GNU Lesser General Public License v3.0
114 stars 28 forks source link

add support for minimization of amplified tests #54

Open monperrus opened 7 years ago

monperrus commented 7 years ago

Motivation: During amplification, there is some neutral test evolution happening. This results in very long and unreadable tests. However, many changes in the amplified test are not required. The goal is minimization is to reduce the size and increase the readability of amplified test cases.

What: Implement a minimization algorithm (such as delta-debugging) to remove useless statements in amplified test cases.

Hints: For instance, useless statement are local variable that are set and never modified such as Object myObject = null; The local variable should be in-lined in this case. For tests that expect an exception, every statement after the exception the one that throws it can be removed.

monperrus commented 6 years ago

initial attempt in #154

sbihel commented 6 years ago

I think it would make sense to remove all amplifications that have no impact on the increase of mutation score.

Simple instrumentation could be used to detect useless generated assertions.

As for input amplification, I think we have to define a limit:

Because if we apply general unit test or even source code minimisation it might be harder for the developer to identify the original test? And they can apply general-purpose minimisation on their own anyway.

monperrus commented 6 years ago

See also the idea of "delta debugging" to minimize.

danglotb commented 6 years ago

I think it would make sense to remove all amplifications that have no impact on the increase of mutation score.

Yes, that the idea.

See also the idea of "delta debugging" to minimize.

The major con with this approach, is the time consumption. In fact, it will take "a lot" of execution of PIT, and so a lot of time.

Simple instrumentation could be used to detect useless generated assertions.

What do you suggest?

In addition to this, we introduce comments in amplified tests and I think they create a lot of noise. Maybe we could first, remove them, when we aim at presenting amplified test to developers.

Would you think that this minimization should be automatically done, and enabled by default, or we should provide it, as an "external service tool" of DSpot?

monperrus commented 6 years ago

Would you think that this minimization should be automatically done, and enabled by default, yes, I think so, in order to maximize the prettiness of the generated tests, so that people like them, also by their look'n'feel. (In Dspot, we generate tests for humans, not for machines)

sbihel commented 6 years ago

Simple instrumentation could be used to detect useless generated assertions.

What do you suggest?

I was thinking of adding a call to a counter after each added assertion. The test would be executed on the new detected mutants and if an assertion never lowers the counter then that means that it never fails, thus is useless.


In addition to this, we introduce comments in amplified tests and I think they create a lot of noise. Maybe we could first, remove them, when we aim at presenting amplified test to developers.

If comments were removed, we (DSpot or the developer) would have to rely on a diff to identify the amplifications, right? Would that be a good solution? By that I mean, does the pretty printer of Spoon generate a source code with the same style as the test given as an input?


Would you think that this minimization should be automatically done, and enabled by default,

yes, I think so, in order to maximize the prettiness of the generated tests, so that people like them, also by their look'n'feel. (In Dspot, we generate tests for humans, not for machines)

It would also be easier to interact with the main amplification process. To have a more powerful interface.

danglotb commented 6 years ago

I was thinking of adding a call to a counter after each added assertion. The test would be executed on the new detected mutants and if an assertion never lowers the counter then that means that it never fails, thus is useless.

The problem is, that we execute the mutation analysis through maven goals. So, it is a new JVM, we will need serialization to obtain infos about the runs and it is kinda of tricky, right?

By that I mean, does the pretty printer of Spoon generate a source code with the same style as the test given as an input?

I think you can rely on the print of Spoon.

It would also be easier to interact with the main amplification process. To have a more powerful interface.

We need to minimize only test that have been selected.

In one hand, if there is a selection it means that the minimization is tight to the selection, right?

In the other hand, some minimization can be done regardless any test criterion such as the in-lining of local variable.

I set up some classes and a test about that: #338. I'm gonna at least this general minimization, using static analysis of the program.

WDYT?

sbihel commented 6 years ago

The problem is, that we execute the mutation analysis through maven goals. So, it is a new JVM, we will need serialization to obtain infos about the runs and it is kinda of tricky, right?

What if each test wrote a report in a file?


In the other hand, some minimization can be done regardless any test criterion such as the in-lining of local variable.

Yes but what I don't really understand is that it will modify the original test. What if the author of the test though it was clearer to use a variable?

danglotb commented 6 years ago

What if each test wrote a report in a file?

It will be the same than serialization / deserialization. I have some issues here.

During the mutation analysis: In the case an assertion never fail, (I am not sure it happens but w/e), we can remove it. In the case an assertion fails, there are two cases:

  1. it detects an already detected mutants by the original test suite.
  2. it detects a new mutant.

In addition to this, we have another dimensions: What we do with the amplified test?

  1. Does the amplified test is an improved version of an existing test, and in this case, the 1. should be kept, since the amplified test is meant to replace the original test.
  2. The amplified test has a new semantic, derived from a existing test, in this case the 1. should be removed, since we will keep the original test and the new test.

I'll think about it.

Yes but what I don't really understand is that it will modify the original test. What if the author of the test though it was clearer to use a variable?

You made a point here. Maybe we should only minimize what DSpot added. We may rely on name convention of local variables, DSpot names them something like __DSPOT_XX. We may also only in-line local variable initialized with literals.

In any case, we won't be able to satisfy everybody, and need to make choices.

sbihel commented 6 years ago

In addition to this, we have another dimensions: What we do with the amplified test?

  1. Does the amplified test is an improved version of an existing test, and in this case, the 1. should be kept, since the amplified test is meant to replace the original test.
  2. The amplified test has a new semantic, derived from a existing test, in this case the 1. should be removed, since we will keep the original test and the new test.

I agree. In the second case, would we still want new mutants to be located in the same method?

sbihel commented 6 years ago

It will be the same than serialization / deserialization. I have some issues here.

Would https://github.com/INRIA/spoon/pull/1874 be useful?

danglotb commented 6 years ago

Hi @sbihel

Would you mind to have look to #354

I propose a minimizer for the ChangeDetectorSelector.

The ChangeDetectorSelector runs amplified test against a "modifier" version of the same program and will keep only amplified test that fail.

The goal is to have amplified tests that encode a change, e.g a new feature or a regression bug.

My idea is to perform a delta-diff on assertions, i.e. remove one by one assertions and see if the amplified test still fail.

WDYT?

sbihel commented 6 years ago

Hi @danglotb,

Wouldn't we need a list of input programs to have all mutants detected by the test case?

Thanks for your efforts 👍

danglotb commented 6 years ago

As I said, some minimization are related to the test criterion used.

For instance, if I use the mutation score as a test criterion, the minimization must keep the mutation score obtained after the amplification.

Here, I am talking about another test criterion: encode a behavioral changes.

The point is, with this selector, that we obtain amplified tests that pass on a given version, and fail on the other one. Such amplified tests, encode the, desired or undesired, behavioral changes.

In one hand, when I say desired, it means that maybe, the developer want that the behavior of the program changes, i.e. it creates a new feature or fix something. In the other hand, when I say undesired, it might be a regression bug. Something that was working before, but does not anymore on the changed version. It means that amplification are able to capture something that was not captured before.

In both case, we win, because we can enhance the test suite.

Back to the minimization of a such test criterion, Do you think that we should only keep assertions that make the amplified test fails? If yes, does the failure should be the same?

sbihel commented 6 years ago

If a behavioural change is detected, that means we keep both versions in the test suite. And thus we can apply general minimisation on the amplified version, using the improved criterion for the combined tests.

I was thinking that a generated assertion could be a duplicate of an existing one. In that case the new assertion would falsely be useful. But if we focus on amplified assertions, with the delta-diff we would detect them.

And I think we should only keep amplified assertions that make the test fail because it enforces clarity on the generated test. If we wanted to keep the exact same failures as before, would it not greatly reduce the range of acceptable amplifications?

monperrus commented 6 years ago

there are two kinds of minimization

monperrus commented 6 years ago

See also: Fine-grained test minimization. | Arash Vahabzadeh, Andrea Stocco, Ali Mesbah 0001 | ICSE | 2018 URL: https://dblp.org/rec/conf/icse/VahabzadehS018

monperrus commented 5 years ago

RW: