default validator's floating-point tolerance behavior

Kattis / problemtools

Tools to manage problem packages using the Kattis problem package format.

MIT License

101 stars 70 forks source link

default validator's floating-point tolerance behavior #102

Open evouga opened 5 years ago

evouga commented 5 years ago

I've had some questions about Kattis's floating-point validator, and I figure this is the best place to start the discussion:

if(!(fabs(jval - tval) <= float_abs_tol) && 
               !(fabs(jval - tval) <= float_rel_tol*fabs(jval))) {
                wrong_answer("Too large difference.\n Judge: %s\n Team: %s\n Difference: %le\n (abs tol %le rel tol %le)", 
                             judge.c_str(), team.c_str(), jval-tval, float_abs_tol, float_rel_tol);
}

For the sake of a concrete example, let's say that a problem asks the user to output pi, to absolute error 1e-3. The above code assumes that the judge output, jval, is exact: that it is specified to infinite precision. If this is not the case, for instance if the judge answer is 3.141 (which, note, is correct to absolute error 1e-3) then a contestant's answer of 3.1425 would be judged as incorrect, despite also being within the absolute error to the exact solution.

Now, one might say, "just add more digits to the judge solution." This alleviates but does not fully resolve the problem: if the judge solution is 3.141592653589 then a contestant's answer of 3.1425926535891 will be judged incorrect, despite technically meeting the judging criterion. Moreover, although we can provably generate arbitrarily many correct digits of pi, for many floating-point problems, the exact solution is unknown and unknowable, and even a carefully crafted judge solution will only print a finite number of accurate digits, before printing just numerical noise.

So I think Kattis needs to take into account both error in the contestant solution, and in the judge data. One possibility is to require judges to guarantee that their solutions are at least as precise as the tolerances in the problem statement; another is to allow them to assert their own, possible tighter, tolerances for their solutions. Let's look at the latter situation, since the former is just a special case:

==Absolute Error==

This one's easy: if the problem statement tolerance is float_abs_tol and the judge tolerance is float_abs_jtol, then Kattis should check fabs(jval - tval) <= (float_abs_tol + float_abs_jtol)

==Relative Error==

==TL;DR==

the current output validator assumes that the judge output has infinite precision, and only correctly enforces the problem statement tolerances when this is true.
the judging formulas should be modified, as above, if the judge data itself might have some error.
it might be reasonable to assume, by default, that the judge output has at most as much error as the limits in the problem statement: in that case *_jtol = *_tol in the above formulas.

simonlindholm commented 5 years ago

In contests I've helped prepare, we've typically used long doubles or similar for the judge, to get very high-precision answers. For the very few problems (like https://finals.codingcup.se/problems/kubb) where we haven't been able to determine the answer with high precision, we've set the problem.yaml float_tolerance to a higher value than listed in the statement. That's not ideal, though, because it means we lie to the contestants (assuming the judge answer is in fact more precise than we know it might be in the worst case).

I don't think we should change behavior here, it breaks backwards compatibility, and would mean lying to contestants on every problem, with an advantage to contestants who know this... If problem authors have low-precision answers, it's on them to deal with that, e.g. by raising float_tolerance.

Automatically inflating tolerances by say a factor 1.01 (rather than 2) would be more fine, but I don't think it solves any real problem, it's very artificial/confusing, and simplicity should win out.

As a contestant, you should expect to have similar problems with slightly incorrect judge answers as you would have with limited floating point precision in general. Outputting exact_answer * (1 + precision) is a bad idea no matter what, and I don't think it's something people should expect to work (as compared to say exact_answer * (1 + 0.7 * precision), which is fine).

evouga commented 5 years ago

@simonlindholm There is no free lunch. We are lying to the contestants either way.

We can choose between:

false positives: an answers is accepted despite having more error than is listed as acceptable in the problem statement;
false negatives: an answer satisfies the problem statement yet is judged as incorrect.

My philosophy (that perhaps you disagree with) is that false positives are a necessary evil, and false negatives are never OK. If a contestant submits an answer that provably satisfies the problem statement, Kattis must judge that answer as AC. The contestant should never be put into a situation where they have to guess whether their WA is truly a bug in their submission, or a bug in Kattis/the judge data.

For example, if I am solving a problem using binary search (like https://open.kattis.com/problems/sheep), with a strong guarantee of no false negatives, I can confidently set the binary search tolerance to 0.005 and know my solution is correct. With the status quo, there is no guarantee, and I need to worry about sticking fudge factors into my own correct code to protect myself from the judge.

By that philosophy, if current Kattis problems judge correct answers (according to the problem statement) as WA, they are currently broken. I don't see a problem with backwards compatibility.

If you can guarantee for some problems that the judge is correct within, say, 1e-10, then you can use that number in the above formulas. I agree there may also be better ways to set defaults for old problems (by assuming that all digits in the judge data are significant, for instance.)

austrin commented 5 years ago

Some comments.

In general I completely agree that false positives are preferable to false negatives.

My baseline view is that it is the responsibility of the problem creators to correctly configure the judging of the problems, and I also don't really have an issue with false negatives in the situations where an answer reaches 99.9999% of the allowed error tolerance, because of the following:

For example, if I am solving a problem using binary search (like https://open.kattis.com/problems/sheep), with a strong guarantee of no false negatives, I can confidently set the binary search tolerance to 0.005 and know my solution is correct.

That's not quite true though, is it? In most cases you will have small rounding errors due to floating point operations in your code, and unless you always print with 50+ digits you are going to incur additional rounding errors when you print your answer, so in order to be confident in your answer you need a little slack regardless.

I do like the general idea of making the tools operate in a way that make it harder to misconfigure problems, but like @simonlindholm I see issues with backwards compatibility and/or transparency with changing the default behavior of the default validator.

However, another option could be to attempt to detect poorly configured real value tolerances and issue a warning in those cases. For instance if the validators float_tolerance is set to 1e-6 and no answer files have more than, say 7 digits, it would probably be advisable to adjust the float tolerance to 1.1e-6.

Somewhat related, I think it would also be a good idea to always fudge the comparisons in the default validator by something like a 1+1e-10 factor just to get on the right side of all floating point errors within the validator itself (e.g., currently with an absolute error tolerance of 1e-2 and a judge answer of 0.30, the answer 0.29 would be rejected).

evouga commented 5 years ago

That's not quite true though, is it? In most cases you will have small rounding errors due to floating point operations in your code, and unless you always print with 50+ digits you are going to incur additional rounding errors when you print your answer, so in order to be confident in your answer you need a little slack regardless.

Yes, I agree. Although still in these cases you can point to the user program and say, "here is the problem. It is in your code (handling of the intermediate values, conversion of final output to decimal, etc) and Kattis is not to blame." That is not so when the imprecision is a result of imprecision in the judge data.

Talking to Greg about this yesterday, I believe the current philosophy is that Kattis treats the judge data as the correct answer, and of course Kattis does currently correctly enforce correctness with respect to that judge data. This position is not unreasonable, but IMO not fully satisfying. When I as a contestant read "your answer will be judged correct if it has relative or absolute error 1e-4" I interpret this error as being with respect to the exact solution of the problem in the problem statement, and not with respect to some specific judge reference implementation.

In my ideal universe Kattis would require problem-setters to provide the judge data uncertainty float_jtol in addition to the float_tol, which Kattis then uses during grading as described in my first post above. It is the responsibility of the problem writer to correctly configure this value (which would also account for the fudge factor you describe), with Kattis warning if the value is too high (relative to the float_tol) or if the judge data does not seem to conform to this tolerance. Past problems can be grandfathered in to have the float_jtol set to zero (and, in my ideal universe, would be at some point upgraded with tolerances set according to the accuracy of their current judge data.)

I'm not so convinced by arguments about transparency since, in my opinion, most readers would interpret current problem wording like "your answer will be judged correct if it has relative or absolute error 1e-4" as describing the proposed system where the comparison is to a platonic, exact solution, rather than the current system comparing to a noisy judge solution.

austrin commented 5 years ago

I believe the current philosophy is that Kattis treats the judge data as the correct answer

FTR (and to be somewhat nitpicky) I would not exactly agree with this. My view is that the judge data is a correct answer, and that correct answers are whatever the output validator for a problem accepts (which, again, brings it back to the problem developer to define what correct answers are).

When I as a contestant read "your answer will be judged correct if it has relative or absolute error 1e-4" I interpret this error as being with respect to the exact solution of the problem in the problem statement,

I agree that if a problem has that exact wording then it sounds like the problem writer is making the kind of promise you refer to (and IMO with that wording, the problem writer should make sure that the claim is true by setting the tolerance to something like 1.01e-4 and having sufficient precision in the answer files). But these precision blurbs come in many different forms:

Another common one is some variant of "Your answer should have an error of at most eps", which, if one reads it literally doesn't make any promise at all, just requirements.
Another variant I've seen several times is "Your answer will be accepted if it is within eps of the judge answer file", a formulation under which the issue at hand disappears but one can as a contestant instead worry about how precise the judge answers are.

I'm not so convinced by arguments about transparency

Just to clarify, the transparency issue I was referring to was not wrt problem solvers, but wrt problem developers, and mostly for the potential solution where some pessimistic estimate on the judge rounding error (e.g. 2x the tolerance) was automatically used under the hood without warning.

I'm not completely opposed to add a flag like the one you suggest and at least issuing a warning if it isn't provided, but am somewhat skeptical that it will be efficient (my guess is that it would be (a) too easy to misconfigure and (b) too annoying). I'm going to poll around a bit how other people would feel about this, stay tuned.

godmar commented 5 years ago

Interesting thread. Philosophically I agree with Etienne that the default interpretation of tolerance should be with respect to the "true" answer, not the judge computed answer.

As a first step, I really like the idea of strengthening or implementing heuristics to check whether the problem setter's judge answer data and the given tolerance setting are consistent/meaningful, based on the provided judge answer data. Here's an example where I felt this would have alerted the problem setters to other issues with their problem. For instance, in ECNA/2017/Craters, the package specified float_tolerance 1e-6 and the judge data was:

1715.91229929
6346.01716025
111478.75838897
74162.07487298
52985.52337647
31478.75838897
6360.93987273
6494.11270793
6694.13628354
80069.11503838
380.99111843
384.99111843
80069.11503838
62894.81984987
95438.75036108
92812.62277694
98076.41744549
96747.59251904
94790.98126736
97663.45606102
97216.39893377
96925.75523324
97546.81798702
94061.31791121
99849.18129575
4976.28276329

As a result, solutions that cheesed the problem by treating circles as polygons of as few as 1,400 points, such as this one, passed within the time bounds. None of the judge solutions used this approach to discretization (though no wrong_answers did, either), so it's probably not what they intended. I think problemtools could have warned about the discrepancy between the precision/magnitude of the judge data and the requested tolerance type.

austrin commented 5 years ago

@godmar your example seems to be of a different type of issue and I don't understand what kind of check you are hoping for that would catch it. Without knowing in details what the problem is about (which problemtools obviously can't), the given tolerance and data look pretty OK -- one can argue that the tolerance should be 1.0001e-6 to account for the judge rounding errors as discussed in this thread, but one can't agnostically say that the specified tolerance is too lax.

godmar commented 5 years ago

Isn't part of the underlying problem that we do not know what the likely error in the judge answer data is? @evouga wants the problem setter to specify this value explicitly; I'm wondering whether it could be inferred from how many decimal digits the judge answer has.

First, this would allow for the check to be adjusted using the interval-style arithmetic @evouga suggests, without requiring a separate float_jtol parameter.

Second, if there's a discrepancy between the implied precision in the judge answer data and the asked-for precision, such as in the ECNA 2017 problem where the judge data has between 11 and 13 decimal digits - 5 more digits than the requested tolerance, a warning would be output. My guess is that if they'd seen the warning, given their judge solution which computes to machine precision, they might have thought of tightening the tolerance. I myself have set problems that I intended to be solved with binary search, but students came back with a discrete/quantizing approach, and I realized that it passed because I all-to-readily specified a floating point tolerance of 1e-4 and the answer data had several digits before the decimal point. Similarly, in problems like suspension bridges I ended up using float_absolute_tolerance just to break 32-bit float solutions (which then earned the problem a "be careful" warning on Steven Halim's problem list page, meaning it's not usually done.)

Under this line of thinking, having a judge answer of 3.141592653589 and asking for only 1e-3 would also be flagged. Such a warning would force problem setters to think more when they create the judge data and, to the extent possible, minimize how many garbage decimal digits they add. (Conversely, it would require outputting 1.000000 instead of 1. to express that the judge answer is given with an error of at most 1e-6.)

But perhaps this would make problem setting with problemtools unnecessarily harder (but it would be in line with requiring things like input validators rejecting leading zeros; I am kidding of course. But wouldn't this idea or at least the general direction appeal to your idealistic instincts?)

godmar commented 5 years ago

regarding Sheep Pen - how exactly is the binary search tolerance related to the problem's desired tolerance? In my solution, I binary search over the radius of the circle, the target function being the sum of the angles of the chords resulting from placing the sticks on the periphery of the circle. The answer, though is the area of the polygon, which is on the order of pi * r^2. The binary search condition is expressed in terms of r_hi - r_lo which you would have to relate to the error in pi * r^2.

As an aside, I haven't seen a problem on Kattis where computing to machine precision (making r_hi -r_lo as low 64-bit doubles allow) would have timed out, even in Java. So my team's binary search default is this:

    /**
     * Return x in [a, b] such that f(x) = y.
     * f() must be monotonic.
     * Also known as bisection method.
     */
    static double binarySearch(Function<Double, Double> f, double low, double high, double y) {
        // stop when high-low approaches the floating point gap, except when
        // mid is around 0 to avoid excessive iterations.
        while ((high - low) > Math.max(1e-16, 10 * Math.ulp((low+high)/2))) {
            double mid = (low + high)/2.0;
            double midVal = f.apply(mid);

            if (midVal > y)
                low = mid;
            else
                high = mid;
        }
        return (low + high)/2.0;
    }