Feature request: tracing search failures

lambdacalculator commented 7 years ago

When I requested, in version 1.3, that Andrew introduce an option to print search witnesses, he did a wonderful job that exceeded my expectations. Since then, it has been a very useful feature for generations of my students learning Abella. (And I'll add that the more recent instantiations flag is also quite useful.)

In that vein, I have another request: it would be nice, whenever a search failed, to be able print out a trace of what was tried leading up to the failure. I know it can be notoriously difficult to get output from deep inside the search and unification processes that makes sense to the user, but if there were a way to do it that was not too onerous, then it would also be very useful, not only for beginners like my students but for experts as well: I usually have to start applying goal tactics like witness, split, left, and right to try to narrow down where the search failure might be, but I could imagine that a trace that's printed out after a search failure could make it much easier to focus on where the problem is.

One idea to keep the output relevant to the user would be to report a failure of unification in the trace only is some threshold in matched subexpressions is met: for example if the heads match and more than 30% of the subterms match, or something like that. That way, if the problem was just that one variable in an expression was raised while the other one wasn't, then that would trigger the unification failure to be output. This would avoid cluttering the output too much with obvious unification failures. There's room here for some interesting heuristics that could be improved over time.

robblanco commented 7 years ago

I have been playing with complex uses of search and unification lately and sympathize emphatically with this predicament, although right away I do not see an obviously natural way to accomplish this. The search tactic will look for a path to success on behalf of the user, and making sense of an extended failure trace is not trivial, for the same reasons that tracing a logic program can be challenging. Knowing roughly what you are looking for can go a long way, and maybe it is possible to give good hints to Abella.

I ignore how this would carry over to unification problems. Here it is easy to become flooded with information as well, and in isolation, without knowing how it relates to the search, I have not found it very useful in my recent experiences. Maybe for relatively shallow attempts like the default search it is manageable, but I suspect that it will quickly get out of hand. Do you have an intuition of what information would be useful, which paths are interesting and which can be discarded, what context is necessary to interpret the information...?

In this context, it may be worthwhile to consider the difference between definite failure and lack of success. By the latter I mean an unfinished search cut short by the depth threshold of the tactic.

chaudhuri commented 7 years ago

I agree with Rob. One thing to note is that when search succeeds it is because of a single reason, whereas when it fails it can be for a dazzlingly large variety of reasons. It wouldn't be possible to select any of these reasons and point it out as the reason something failed.

What we can implement is something like Isabelle's tracing of the rewrite engine that just prints out everything attempted. (Indeed, this is what we use internally to debug search, but it is commented out.) My experience with such traces is that they are only useful when one has identified a clear bug and one just wants to pinpoint its location. They rarely yield insights about the theorem itself.

lambdacalculator commented 7 years ago

I anticipated the difficulty of getting useful information from the potentially overwhelming output to come out of a failed search in my initial comment, but my intuition is still that some good heuristics can narrow down the problem significantly. Yes, searches fail because of a large number of dead ends that don't pan out, but I'm interested specifically in the "near misses", where the search would have been able to continue if not for a unification that "almost" succeeded.

We can assume that if a search is issued, the expectation is that it will succeed. It is probably the case, therefore, that large parts of the search will be "routine" -- unfoldings will proceed along unique paths, witness will found that prove 3 out of 4 of the conjuncts in the goal and are able to supply some of the unique unfolding of the last conjunct, but for a permuted nominal, or an eigenvariable clash, or a clearly identified but missing hypothesis that stands in the way of success. Information about these near misses and what would have made them successes could be quite useful and not at all overwhelming.

Perhaps what I am proposing is that the unification algorithm, instead of just returning failure, return some kind of measure of how much a failure it was. This metric could then be used as an input to the heuristic to fine-tune the reporting mechanism, keeping the information relevant.

As for search bounds being exceeded, I think this is a different issue. Did the search fail because of a loop of some kind, where the search space is relatively small but keeps producing similar terms with new variables? Maybe an induction could be suggested. Or did the search fail because of a huge search space that couldn't be explored? There's probably not much useful information that could be given in this case, except maybe some summary statistics about the search that might help the user narrow it down.

chaudhuri commented 7 years ago

Oops! Got the number wrong in my commit message.

robblanco commented 7 years ago

I considered something along the lines of @lambdacalculator's last comment. A simple compression of a "failure trace" could simply report the "nodes" of the tree, i.e., bifurcation points and dead ends, while compacting work done linearly.

I also think I see where you are going with this notion of how close to a successful unification we can get, and maybe there is something interesting there.

More routinely, as Kaustuv noted, having a tracing flag à la Isabelle, instead of compiling Abella for debugging, seems interesting, although the use I have had for it has been hunting bugs and not inspecting proofs.

abella-prover / abella

Feature request: tracing search failures #89