Add software regression tests

alashworth / test-issue-import

0 stars 0 forks source link

Add software regression tests #149

Open alashworth opened 5 years ago

alashworth commented 5 years ago

Issue by syclik Saturday Jun 17, 2017 at 02:54 GMT Originally opened as https://github.com/stan-dev/stan/issues/2331

Summary:

We don't have software regression tests in place. We should introduce them and test them.

Description:

More accurately, we could use some characterization tests.

Here are a few things that we should test:

[ ] reproducibility between runs with the same seed for some models (tbd)
[ ] reproducibility across different commits with some models (tbd)

It's ok that things change over time. We just don't ever want the changes to happen unintentionally.

(There's no reason we can't characterize Stan's behavior before we agree on how to test correctness in light of stochastic algorithms. Once a random seed is set [with a particular set of hardware, environment, etc], it should run the same.)

Current Version:

v2.15.0

alashworth commented 5 years ago

Comment by seantalts Saturday Jun 17, 2017 at 15:00 GMT

I've worked with these kinds of tests before (we called them "golden tests" or something like that) and they were a huge pain until we developed automated tools to update the golden records.

alashworth commented 5 years ago

Comment by syclik Saturday Jun 17, 2017 at 15:16 GMT

Have any suggestions on how to update golden records?

I'd rather have difficult tests over breaking things inadvertently. This comes from having things being broken multiple times through the history of our project from something that wasn't caught in unit tests.

On Jun 17, 2017, at 11:00 AM, seantalts notifications@github.com wrote:

I've worked with these kinds of tests before (we called them "golden tests" or something like that) and they were a huge pain until we developed automated tools to update the golden records.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by seantalts Saturday Jun 17, 2017 at 15:23 GMT

We had a folder full of the golden master records for API json output, pretty-printed and sorted so it was deterministic. Each test file had a corresponding output file, I think. When running a test, record its output to a temporary file, check it against the golden record, and fail if there's a diff and display the diff. We added a flag to the test binaries (ours could be in runTests.py instead) that would not do asserts or checks but would just replace the golden records with whatever the tests were currently outputting. This was kind of nice because these records were stored in git with the same code that produced them, so when you updated code you could see in the same commit how the output had changed, or if it hadn't.

alashworth commented 5 years ago

Comment by syclik Saturday Jun 17, 2017 at 16:30 GMT

That sounds great! We just need to deal with the platform dependence, but maybe we can work through that by storing different outputs.

On Jun 17, 2017, at 11:23 AM, seantalts notifications@github.com wrote:

We had a folder full of the golden master records for API json output, pretty-printed and sorted so it was deterministic. Each test file had a corresponding output file, I think. When running a test, record its output to a temporary file, check it against the golden record, and fail if there's a diff and display the diff. We added a flag to the test binaries (ours could be in runTests.py instead) that would not do asserts or checks but would just replace the golden records with whatever the tests were currently outputting. This was kind of nice because these records were stored in git with the same code that produced them, so when you updated code you could see in the same commit how the output had changed, or if it hadn't.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by bob-carpenter Saturday Jun 17, 2017 at 19:08 GMT

When sampling algorithms broke in the past, it was because algorithms changed. Those changes intentionally break any regressions that are trying to measure reproducibility at the floating point/seed level. So I don't see how this kind of test could have caught the sampler bugs. I think what we want to focus on for this is the kind of testing Michael has up in the testing repo.

We may also want regression tests for efficiency. Again, those are challenging in the face of stochastic algorithms with intentionally changing behavior.

Regressions would be able to catch the bugs introduced into interfaces, such as losing warning messages during initialization or in the parser when a model becomes unparseable (though the latter is easy to test within Stan).

alashworth commented 5 years ago

Comment by syclik Saturday Jun 17, 2017 at 19:33 GMT

This would be in addition to the statistical correctness tests. If we are serious about having more contributors, we have to make sure things don't inadvertently change.

For example, if we changed a warning message, I would not expect any draws to be different. When we added file includes, I also expect nothing to change for the draws. If they do, then at the very least we should know about it. These are things that are going to be harder to catch at the unit level.

Most changes shouldn't cause different behavior. I see things like algorithmic changes, some updates to libraries, and bug fixes causing differences between a run with the previous hash. Refactoring, changes to autodiff memory handling, new functions, and most language updates shouldn't. When they do, it'd be nice to know that it does and the developer should think about whether it makes sense that it does.

On Jun 17, 2017, at 3:08 PM, Bob Carpenter notifications@github.com wrote:

When sampling algorithms broke in the past, it was because algorithms changed. Those changes intentionally break any regressions that are trying to measure reproducibility at the floating point/seed level. So I don't see how this kind of test could have caught the sampler bugs. I think what we want to focus on for this is the kind of testing Michael has up in the testing repo.

We may also want regression tests for efficiency. Again, those are challenging in the face of stochastic algorithms with intentionally changing behavior.

Regressions would be able to catch the bugs introduced into interfaces, such as losing warning messages during initialization or in the parser when a model becomes unparseable (though the latter is easy to test within Stan).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by syclik Saturday Jun 17, 2017 at 19:41 GMT

My main concern is with the stability of our code base. If I can't trust that code runs the same merging with develop, then it gets harder and harder to have distributed development.

alashworth commented 5 years ago

Comment by syclik Saturday Jun 17, 2017 at 19:44 GMT

If there's any other / easier way of accomplishing it, please suggest it. I don't think golden tests are ideal, but if there isn't a viable alternative, I'd rather have it than not have it.

alashworth commented 5 years ago

Comment by bob-carpenter Saturday Jun 17, 2017 at 23:50 GMT

@syclik I'm not arguing that these "golden tests" aren't necessary, I'm arguing they aren't sufficient. They wouldn't have caught the problem with output disappearing. They won't catch performance regressions for time or memory.

I do see two problems with these golden tests. The first is that it becomes massively painful to make minor updates like fixing typos in messages. When those are recorded all over the place, all the gold tests need to be regenerated. I think that's what @seantalts was getting at. The second problem is that if it's easy to generate new tests, it'll be easy to miss things by regenerating the tests too frequently. If you and he manage to get them going, we'll have to see how that goes.

Testing is a delicate balance when it comes to distributed development. I don't think more is always better, at least insofar as recruiting more developers is concerned. There are many obstacles in the way of contributors: the complexity of the C++, the relative lack of documentation, the constantly changing underlying organization of the code base, our rather scattershot and inconsistent approach to code review (particularly getting conflicting feedback), the fact that code to do one thing can be spread over a half dozen files, and the extensive testing we require before merging.

And no, I don't have good alternatives to propose.

alashworth commented 5 years ago

Comment by seantalts Sunday Jun 18, 2017 at 16:28 GMT

I think if @syclik wants to add tooling such that regenerating golds is just an extra flag to runTests.py, it could become minimally invasive? We also probably wouldn't just gold all existing tests; rather it seems like we'd make some new tests specifically for this purpose. But Bob is also right in that every thing like this just adds additional hurdles to contributions. It'd be great if we could replace some of our existing tests or process with this new kind as a tradeoff to try to keep developer burden low.

alashworth commented 5 years ago

Comment by betanalpha Monday Jun 19, 2017 at 03:53 GMT

Having golden tests or regression tests is all well and good, but as Bob and Sean have noted they can be a massive burden on new developers. Not only do they stale easily and require constant updates, it’s not even clear to the developer what changes might cause those tests to break. So not only would we need code to autogenerate new tests (I ended up writing a bunch of scripts to do that for the stan_read_csv function with has such tests) we would need some kind of hook that sees if any files changed correspond to a given golden test and warn the submitter that the test might need to be updated (ideally as easily as running an existing script and if the golden test changed then adding it to the PR).

On Jun 18, 2017, at 12:28 PM, seantalts notifications@github.com wrote:

I think if @syclik https://github.com/syclik wants to add tooling such that regenerating golds is just an extra flag to runTests.py, it could become minimally invasive? We also probably wouldn't just gold all existing tests; rather it seems like we'd make some new tests specifically for this purpose. But Bob is also right in that every thing like this just adds additional hurdles to contributions. It'd be great if we could replace some of our existing tests or process with this new kind as a tradeoff to try to keep developer burden low.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309287881, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNllyB4D_a_GdEwBtZwKdQdlz2nXHaks5sFVBIgaJpZM4N9G3c.

alashworth commented 5 years ago

Comment by syclik Monday Jun 19, 2017 at 04:36 GMT

I hear the arguments for it being a burden. But I think we're framing it the wrong way.

It should be seen as something to make sure the rest of Stan isn't broken. It's more objective in what is and isn't included for code. And it keeps the code base moving forward. For new developers, it should be a relief that the first thing they're trying to contribute isn't going to tank the code base.

Yes, we need to design it so it's not a burden. We can take that as a requirement. We should also cap the run time somehow.

The old tests weren't designed with this in mind. It was just something quick that I threw up to make sure we didn't introduce things that tanked the speed. Then after having behavior change and introducing major bugs where we scrambled and backtracked on releases, I tacked on checking consistency across different runs. I'm still glad it's been checked pretty regularly. It's stopped at least a handful of subtle bugs from going out since it's been in. Most of the time, this hasn't been a burden on the developer because it runs with our contiuous integration and we didn't ask devs to run it locally. For most pull requests, it isn't even something we have to think about. It's really helped to identify pull requests that needed further review beyond just inspecting code. Once again, intentionally changing behavior is ok to me. Accidentally changing how Stan works and letting that slide isn't. (Michael, sorry it's slowed down some of your work. That wasn't the intent.)

On Jun 18, 2017, at 11:53 PM, Michael Betancourt notifications@github.com wrote:

Having golden tests or regression tests is all well and good, but as Bob and Sean have noted they can be a massive burden on new developers. Not only do they stale easily and require constant updates, it’s not even clear to the developer what changes might cause those tests to break. So not only would we need code to autogenerate new tests (I ended up writing a bunch of scripts to do that for the stan_read_csv function with has such tests) we would need some kind of hook that sees if any files changed correspond to a given golden test and warn the submitter that the test might need to be updated (ideally as easily as running an existing script and if the golden test changed then adding it to the PR).

On Jun 18, 2017, at 12:28 PM, seantalts notifications@github.com wrote:

I think if @syclik https://github.com/syclik wants to add tooling such that regenerating golds is just an extra flag to runTests.py, it could become minimally invasive? We also probably wouldn't just gold all existing tests; rather it seems like we'd make some new tests specifically for this purpose. But Bob is also right in that every thing like this just adds additional hurdles to contributions. It'd be great if we could replace some of our existing tests or process with this new kind as a tradeoff to try to keep developer burden low.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309287881, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNllyB4D_a_GdEwBtZwKdQdlz2nXHaks5sFVBIgaJpZM4N9G3c.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by betanalpha Monday Jun 19, 2017 at 12:43 GMT

Your description is far too subjective and illustrates the asymmetric nature of the burden of these tests. When I was tweaking the samplers these tests would cause no end of pain, and even when I had started putting scripts together it made the process a pain in the ass. And I was expecting it! Especially bad was at one point the tests were being run on a machine that gave different bitwise behavior than a Mac and I had to wait for the test to run on Jenkins to see which new values I needed to put into the tests for the CI to pass.

The point is that our testing is littered with fragile tests like this that were put in as overreactions to bugs. It was one thing when you were around full time to escort each PR through those tests, but they evolve from annoyance to dangerous burden when you cannot commit your full time to watching them.

So, again, while some golden/regression tests are welcome the burden is on you to put together a manageable system with hooks to identify when tests might need to be updated and scripts to automatically do that updating before anything new is added to the test framework.

On Jun 19, 2017, at 12:36 AM, Daniel Lee notifications@github.com wrote:

I hear the arguments for it being a burden. But I think we're framing it the wrong way.

It should be seen as something to make sure the rest of Stan isn't broken. It's more objective in what is and isn't included for code. And it keeps the code base moving forward. For new developers, it should be a relief that the first thing they're trying to contribute isn't going to tank the code base.

Yes, we need to design it so it's not a burden. We can take that as a requirement. We should also cap the run time somehow.

The old tests weren't designed with this in mind. It was just something quick that I threw up to make sure we didn't introduce things that tanked the speed. Then after having behavior change and introducing major bugs where we scrambled and backtracked on releases, I tacked on checking consistency across different runs. I'm still glad it's been checked pretty regularly. It's stopped at least a handful of subtle bugs from going out since it's been in. Most of the time, this hasn't been a burden on the developer because it runs with our contiuous integration and we didn't ask devs to run it locally. For most pull requests, it isn't even something we have to think about. It's really helped to identify pull requests that needed further review beyond just inspecting code. Once again, intentionally changing behavior is ok to me. Accidentally changing how Stan works and letting that slide isn't. (Michael, sorry it's slowed down some of your work. That wasn't the intent.)

On Jun 18, 2017, at 11:53 PM, Michael Betancourt notifications@github.com wrote:

Having golden tests or regression tests is all well and good, but as Bob and Sean have noted they can be a massive burden on new developers. Not only do they stale easily and require constant updates, it’s not even clear to the developer what changes might cause those tests to break. So not only would we need code to autogenerate new tests (I ended up writing a bunch of scripts to do that for the stan_read_csv function with has such tests) we would need some kind of hook that sees if any files changed correspond to a given golden test and warn the submitter that the test might need to be updated (ideally as easily as running an existing script and if the golden test changed then adding it to the PR).

On Jun 18, 2017, at 12:28 PM, seantalts notifications@github.com wrote:

I think if @syclik https://github.com/syclik wants to add tooling such that regenerating golds is just an extra flag to runTests.py, it could become minimally invasive? We also probably wouldn't just gold all existing tests; rather it seems like we'd make some new tests specifically for this purpose. But Bob is also right in that every thing like this just adds additional hurdles to contributions. It'd be great if we could replace some of our existing tests or process with this new kind as a tradeoff to try to keep developer burden low.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309287881, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNllyB4D_a_GdEwBtZwKdQdlz2nXHaks5sFVBIgaJpZM4N9G3c.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309336704, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNln-EckFm64D2205HXM8NMpxZdW3Sks5sFfrrgaJpZM4N9G3c.

alashworth commented 5 years ago

Comment by syclik Monday Jun 19, 2017 at 13:50 GMT

On Jun 19, 2017, at 8:43 AM, Michael Betancourt notifications@github.com wrote:

Your description is far too subjective and illustrates the asymmetric nature of the burden of these tests. When I was tweaking the samplers these tests would cause no end of pain, and even when I had started putting scripts together it made the process a pain in the ass. And I was expecting it! Especially bad was at one point the tests were being run on a machine that gave different bitwise behavior than a Mac and I had to wait for the test to run on Jenkins to see which new values I needed to put into the tests for the CI to pass.

Those tests could have been replaced by better ones and if you asked for help, I would have gladly helped you get that off the ground. Once again, sorry it was a pain, but the usefulness of those tests in preventing gross errors outweigh your particular pain.

So, instead of just complaining, want to help describe what would be useful and what would be unwieldy?

The point is that our testing is littered with fragile tests like this that were put in as overreactions to bugs. It was one thing when you were around full time to escort each PR through those tests, but they evolve from annoyance to dangerous burden when you cannot commit your full time to watching them.

I don't know where to start here. My job was never to escort bad pull requests. I did it as someone that cares about the community and the project. It was necessary at the time. Since I'm no longer at Columbia full time, it's even more important that the devs (the collection of all of us) maintain confidence in the project. This isn't and hasn't been a sandbox for Columbia / Andrew's research group for quite some time now.

So, again, while some golden/regression tests are welcome the burden is on you to put together a manageable system with hooks to identify when tests might need to be updated and scripts to automatically do that updating before anything new is added to the test framework.

Yup. That's what I agreed to a few comments ago.

On Jun 19, 2017, at 12:36 AM, Daniel Lee notifications@github.com wrote:

I hear the arguments for it being a burden. But I think we're framing it the wrong way.

It should be seen as something to make sure the rest of Stan isn't broken. It's more objective in what is and isn't included for code. And it keeps the code base moving forward. For new developers, it should be a relief that the first thing they're trying to contribute isn't going to tank the code base.

Yes, we need to design it so it's not a burden. We can take that as a requirement. We should also cap the run time somehow.

The old tests weren't designed with this in mind. It was just something quick that I threw up to make sure we didn't introduce things that tanked the speed. Then after having behavior change and introducing major bugs where we scrambled and backtracked on releases, I tacked on checking consistency across different runs. I'm still glad it's been checked pretty regularly. It's stopped at least a handful of subtle bugs from going out since it's been in. Most of the time, this hasn't been a burden on the developer because it runs with our contiuous integration and we didn't ask devs to run it locally. For most pull requests, it isn't even something we have to think about. It's really helped to identify pull requests that needed further review beyond just inspecting code. Once again, intentionally changing behavior is ok to me. Accidentally changing how Stan works and letting that slide isn't. (Michael, sorry it's slowed down some of your work. That wasn't the intent.)

On Jun 18, 2017, at 11:53 PM, Michael Betancourt notifications@github.com wrote:

Having golden tests or regression tests is all well and good, but as Bob and Sean have noted they can be a massive burden on new developers. Not only do they stale easily and require constant updates, it’s not even clear to the developer what changes might cause those tests to break. So not only would we need code to autogenerate new tests (I ended up writing a bunch of scripts to do that for the stan_read_csv function with has such tests) we would need some kind of hook that sees if any files changed correspond to a given golden test and warn the submitter that the test might need to be updated (ideally as easily as running an existing script and if the golden test changed then adding it to the PR).

On Jun 18, 2017, at 12:28 PM, seantalts notifications@github.com wrote:

I think if @syclik https://github.com/syclik wants to add tooling such that regenerating golds is just an extra flag to runTests.py, it could become minimally invasive? We also probably wouldn't just gold all existing tests; rather it seems like we'd make some new tests specifically for this purpose. But Bob is also right in that every thing like this just adds additional hurdles to contributions. It'd be great if we could replace some of our existing tests or process with this new kind as a tradeoff to try to keep developer burden low.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309287881, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNllyB4D_a_GdEwBtZwKdQdlz2nXHaks5sFVBIgaJpZM4N9G3c.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309336704, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNln-EckFm64D2205HXM8NMpxZdW3Sks5sFfrrgaJpZM4N9G3c.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by bob-carpenter Monday Jun 19, 2017 at 16:37 GMT

I think we can all agree that we released broken code that might have been caught with the right tests. Just to remind everyone, two relevant cases are:

breaking changes to the samplers
losing error output when the sampler catches an exception

Gold tests would catch the latter, but not the former.

I also think we can all agree that the current brittle tests are a pain point for developers.

The decision on how to move forward spans repos. We are not being consistent about testing practices across repos.

alashworth commented 5 years ago

Comment by syclik Monday Jun 19, 2017 at 17:24 GMT

There's one more relevant test case:

changes to the code base that pass unit tests, but change behavior in the samplers due to some side-effects.

Gold tests would also catch that even in light of weak unit tests. We've already stopped a handful of those accidentally getting into the code base.

On Mon, Jun 19, 2017 at 12:37 PM, Bob Carpenter notifications@github.com wrote:

I think we can all agree that we released broken code that might have been caught with the right tests. Just to remind everyone, two relevant cases are:

-

breaking changes to the samplers

losing error output when the sampler catches an exception

Gold tests would catch the latter, but not the former.

I also think we can all agree that the current brittle tests are a pain point for developers.

The decision on how to move forward spans repos. We are not being consistent about testing practices across repos.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309496003, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_FyOymqn6lOJiC0F5xcNUUP2q63DJks5sFqOtgaJpZM4N9G3c .

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Jun 20, 2017 at 00:05 GMT

On Jun 19, 2017, at 1:24 PM, Daniel Lee notifications@github.com wrote:

There's one more relevant test case:

changes to the code base that pass unit tests, but change behavior in the samplers due to some side-effects.

Gold tests would also catch that even in light of weak unit tests. We've already stopped a handful of those accidentally getting into the code base.

Thanks---that's a good case that the gold tests would catch. Change to something other than samplers that aren't intended to change sampling behavior, but in fact break sampling behavior.

I don't recall that happening, but I can see how it could.

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Jun 20, 2017 at 00:23 GMT

On Jun 19, 2017, at 6:46 PM, breckbaldwin notifications@github.com wrote:

In reading this I see a few meta-issues floating to the top and have some questions and observations.

• Have you all mind-melded with the proposal? Everyone that has commented so far is totally up to date on characterization tests that were helpfully linked right?

I think we're all on the same page. The discussion is around what it would look like in practice, specifically:

how easily it could be automated
whether it would catch any errors

• The back and forth got a little snippy there. Fuck that shit.

Agree with the sentiment if not its expression.

Overall we need a process that makes decisions in some sane fashion. In open source it seems like the preferred ordering is consensus > consensus with individual grumbling > vote > fork. I don't see any benevolent dictators at this point. A separate discussion but one we need to have.

I think we may be at the consensus with grumbling stage here, but I'm not sure. I'm willing to give it a try.

I also agree that broken software is worse than grumbling developers, but it's a matter of degree. If it gets too hard to work on Stan, people will just quit. I've felt near that point myself on several occasions, because it's gotten very time consuming to produce even the simplest code change. Major changes, like the refactor Daniel took on, took years. The reason it took so long is that he insisted on having these tests (though apparently, they weren't thorough enough, because we still broke the software).

This isn't magic. Unless we have the right programs to run that tickle the right output, they won't be testing everything. Now we know to include cases where initialization can't get off the ground. But what else are we missing? Do we include a test for all N^K possible combinations of arguments to each interface function?

The other issue is where to test these things at. We're on the Stan repo now, but Stan itself doesn't do any of this. We need an interface to plug into the callbacks before we even have output to test again.

PS What do I think makes the most sense? Do dumb, automated characterization testing on verbose system output just to catch what units tests don't and make Sean do it since he is too nice to say no. We can't release broken software folks.

The question isn't whether we'll release broken software. Bugs happen. The question is what's the cost of bugs vs. the cost of slowing down development. I don't think there's a simple answer and I think I'm between Michael and Daniel on this, closer to Daniel's opinion.

alashworth commented 5 years ago

Comment by sakrejda Tuesday Jun 20, 2017 at 13:30 GMT

If we go down this golden test route it would be helpful to also put future math tests in that sort of format. So far changing special functions I've had to update a lot of very specific calculations that should just live in a table with doc on the source. Weirdly enough it seems like a lot of the old implementations wouldn't pass tests with the new test values even though I got the new test values from outside sources (mostly Mathematica). It's a little disconcerting but with no sourcing on the old values I can't figure out why reference implementations might have different answers.

On Mon, Jun 19, 2017 at 8:23 PM Bob Carpenter notifications@github.com wrote:

On Jun 19, 2017, at 6:46 PM, breckbaldwin notifications@github.com wrote:

In reading this I see a few meta-issues floating to the top and have some questions and observations.

• Have you all mind-melded with the proposal? Everyone that has commented so far is totally up to date on characterization tests that were helpfully linked right?

I think we're all on the same page. The discussion is around what it would look like in practice, specifically:

how easily it could be automated

whether it would catch any errors

• The back and forth got a little snippy there. Fuck that shit.

Agree with the sentiment if not its expression.

Overall we need a process that makes decisions in some sane fashion. In open source it seems like the preferred ordering is consensus > consensus with individual grumbling > vote > fork. I don't see any benevolent dictators at this point. A separate discussion but one we need to have.

I think we may be at the consensus with grumbling stage here, but I'm not sure. I'm willing to give it a try.

I also agree that broken software is worse than grumbling developers, but it's a matter of degree. If it gets too hard to work on Stan, people will just quit. I've felt near that point myself on several occasions, because it's gotten very time consuming to produce even the simplest code change. Major changes, like the refactor Daniel took on, took years. The reason it took so long is that he insisted on having these tests (though apparently, they weren't thorough enough, because we still broke the software).

This isn't magic. Unless we have the right programs to run that tickle the right output, they won't be testing everything. Now we know to include cases where initialization can't get off the ground. But what else are we missing? Do we include a test for all N^K possible combinations of arguments to each interface function?

The other issue is where to test these things at. We're on the Stan repo now, but Stan itself doesn't do any of this. We need an interface to plug into the callbacks before we even have output to test again.

PS What do I think makes the most sense? Do dumb, automated characterization testing on verbose system output just to catch what units tests don't and make Sean do it since he is too nice to say no. We can't release broken software folks.

The question isn't whether we'll release broken software. Bugs happen. The question is what's the cost of bugs vs. the cost of slowing down development. I don't think there's a simple answer and I think I'm between Michael and Daniel on this, closer to Daniel's opinion.

Bob

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2331#issuecomment-309609178, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfA6YObPFNp8B2YV3T39a1vHhM-LiQAks5sFxD0gaJpZM4N9G3c .

alashworth commented 5 years ago

Comment by breckbaldwin Tuesday Jun 20, 2017 at 15:32 GMT

Sorry for the tone of my post, cold medicine was not helping but seriously sorry.

From what I can see characterization tests should be low effort on the contributor's part. If they are not it is a nightmare and no one is advocating for anything but an automatic process. It would most likely run from a test script that setup git commits etc if the contributor has usefully changed system output.

I think there is consensus that the tests will catch useful errors.

I think there is consensus that similar previous testing systems had problems.

I suggest that we keep the interfaces and more difficult regression tests out of the discussion for now.

A new developer facing a failed characterization test is deep water most likely since failed code will be remote to what they just tested.

The issue is cost of implementation. If Stan could have characterization tests for free and no hassle then we would do it right?

There is a Java JUnit charcterization test system that may have useful guidance if we rolled our own.

A potentially lower overhead characterization test system would be to just use the existing unit testing frame work but with developer added values that are known to be from a successful runs rather than known to be correct values. Could turn nightmarish quickly if tests are dependent on Stan program values.

I'll look into some possible implementations and report back.

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Jun 20, 2017 at 16:06 GMT

I'd rather not literally annotate the source, but I am going to switch all the tests so that they basically look like this:

https://github.com/stan-dev/math/blob/fca2a772f75e2bbc10c76380b0c5e33bf31e6fcc/test/unit/math/mix/core/operator_addition_test.cpp

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Jun 20, 2017 at 16:12 GMT

W.r.t. https://github.com/stan-dev/stan/issues/2331#issuecomment-309797184, the problem with not having interface tests is that we need them for the characterization tests that @syclik is suggesting because stan-dev/stan doesn't have any output without implementing all the callbacks. Or is there some way to do these I'm not seeing?

Upstream tests for stan-dev/stan should get kicked off, so we could plumb in interface tests that way.

The piece of the puzzle we're missing is autogenerating new characterization tests when things change on purpose. I don't see how to test that only the intended changes are made. So we'll still need unit-like tests for individual parts of the output.

alashworth commented 5 years ago

Comment by syclik Thursday Jun 29, 2017 at 08:28 GMT

@bob-carpenter: good points.

Fortunately, we should be able to test with just the Stan library using callbacks. We could also test with the interfaces.

Agrees that it's going to be hard making sure only parts of the output change. But I think whenever that happens, we should go and walk through those cases by hand anyway; if a change to the code base is changing the behavior for everything, it's probably a good thing that we observe the output for all the places it's changed.

I'll need a little more time to come up with a spec. I'll try to have it done next week.