Gold Tests for successful examples

isovector commented 6 years ago

This PR uses tasty-golden to generate gold tests for every passing example from Examples.hs (as marked in #25). It also pulls some of the testing utils out into their own Miscellany module, so that Examples.hs and BasicTests.hs can share them. The new tests are exposed in their own cabal entry, so it shouldn't affect any existing workflows.

It takes about 10 minutes on my machine to build this module. I suspect we could pull better compilation performance out of it if we separated it into several smaller modules, and compiled with -j, but that's a job for another PR.

Gold tests are written to examples/test/gold/, and can automatically regenerated via stack test :gold-tests --test-arguments="--accept" should they become out of date.

This PR doesn't merge yet; wanted to get some feedback on its mergeability before I go through the effort of merging.

conal commented 6 years ago

Thanks for this work! Some discussion points:

I ran several examples from Examples.hs yesterday, finding that many of those marked "failed" or "timeout" worked fine.
I wonder if a single module can serve both for somewhat-systematic testing (e.g., gold tests) as well as my ad hoc use for experimenting with transformations in the plugin (correctness and efficiency of generated computations). My impression of where you're going here is to move working examples from Examples.hs to BasicTest.hs. I may need to copy and re-tweak some of those tests back to Examples.hs, hopefully in way that's easy to do.
Perhaps we're better off separating example categories from testing and maybe manual, ad hoc testing from automated, systematic testing. Now, all three are in concat/examples.
If we're to factor automated testing into several modules, we may want test modules to correspond to categories or combinations of categories (e.g., AD + circuit). And maybe be able to select subsets of examples to compile per module, e.g., by having groups of examples with CPP conditionals. For instance, we could organize examples into groups numbered from zero upward, and there could be CPP flags for lower and upper bounds of which groups get compiled. Then various testing scripts could choose test modules and example ranges within those modules.

isovector commented 6 years ago

tl;dr: What would we do with Examples in an ideal world? What should we do with Examples today?

Curious about the failed tests. Some of them failed for missing imports, so I wonder if there's been an environmental change that fixed them. The timeout ones are likely just tests that were too slow for me -- I set a timeout of 30 seconds for them compile (some seem to take exponential time, which really adds up when you're compiling a couple hundred of these things). The good news is that the tooling in #25 should allow me to update the statuses of these tests quickly.

My plan here is indeed to move the tests out of Examples. I'm OK leaving them there as well, but there's going to be a cognitive burden in terms of which set of tests are canonical. We could also just say Examples is Conal's playground and leave it unmolested while working on systematizing the testing in other modules. IMO the real solution here is to figure out why the plugin doesn't work in GHCI and fix that, rather than doing REPL-like things with modules.

I agree strongly with your last two points, but would prefer to work towards them incrementally. The amount of work involved in sorting out Examples is already nigh overwhelming :)

conal commented 6 years ago

Thanks for the thoughts.

IMO the real solution here is to figure out why the plugin doesn't work in GHCI and fix that, rather than doing REPL-like things with modules.

Amen to getting the plugin working with GHCi! It'd still be problematic to byte-compile a module that contains many examples if they're not mostly disabled, since the time-consuming compilation would happen when loading (byte-compiling) a module rather than performing a test. For Emacs, I guess I'd write some elisp code that zaps #if-hidden examples to the repl.

conal commented 6 years ago

@capn-freako Any thoughts on this PR and issues of testing?

capn-freako commented 6 years ago

Hi all,

Full disclosure: I don't really know much about proper software testing methodologies, at all. :( (Hoping you guys can educate me.)

My experiences/hopes/frustrations/opinions are all largely in line w/ what I read above, specifically (and, so that you can check my understanding):

I would REALLY love to be able to run plug-in dependent code from the GHCI command prompt!!!
I agree with the need to maintain a testing "play" area that is allowed to live outside the rules we adopt to govern the process of changing our developing canonical regression suite. The situation is still much too dynamic to get rid of that, now. That said, I'd love an alternative to the commenting/uncommenting of code we've been doing. That has always seemed fraught with peril to me.
I love the idea of extracting the more stable tests from the play area, above, to form our "golden", or "canonical", regression suite. I think having such a suite for, for instance, gating source code pushes will prove valuable, as more contributors join the effort.

Sorry I really can't do much other than reaffirm what you guys have already said. If something new, or contrary, occurs to me, I'll post here so we can discuss.

Thanks for driving this forward, Sandy!

-db

isovector commented 6 years ago

If nobody has any concrete solutions, I'll suggest the following:

1) We merge this PR with the explicit goal of uncommenting as many lines of code as possible. Code that runs is >>>> code that can rot because it's a comment. 2) We continue pushing to remove all comments from Examples, but continue to use Examples as a place for interactively working on concat. Working on improving specific, old examples is done by yanking the relevant example from the tests. 3) Instead of commenting out new examples in Examples, they are instead moved into the gold testing framework. 4) We set up CI so that no human needs to actually run the sloooow tests by hand, and instead you'll get a notification on a PR if it breaks anything (or an email, if you have direct commit access). 5) I continue splitting out tests, sorting them by which categories they test (or by whatever criteria y'all would prefer).

If you are happy with (2) and (3), I don't mind continuing to fight with the tests and setting up CI.

Thoughts?

conal commented 6 years ago

Thanks, for the comments, @capn-freako. I don't have much experience or insight about testing either, so I'm glad for the perspectives.

conal commented 6 years ago

Thanks for the articulation, @isovector! I'm happy with this plan. Please do continue and expect me to approve the pull request. I'm not very good at git merging, so I'll appreciate whatever you can do to help ease that process.

compiling-to-categories / concat

Gold Tests for successful examples #26