[analyzer] integral casts are not modeled correctly

Quuxplusone commented 3 years ago


Bugzilla Link	PR51036
Status	CONFIRMED
Importance	P normal
Reported by	Balazs Benics (balazs.benics@sigmatechnology.se)
Reported on	2021-07-09 01:30:50 -0700
Last modified on	2021-07-23 07:52:56 -0700
Version	12.0
Hardware	PC All
CC	dcoughlin@apple.com, dpetrov@accesssoftek.com, llvm-bugs@lists.llvm.org, martongabesz@gmail.com, nok.raven@gmail.com, noqnoqneo@gmail.com, vsavchenko@apple.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

I've triaged a couple of false-positive reports and I observed something
interesting.
Check this example: https://godbolt.org/z/vb1Y5z6GE

  #include <cstdio>

  void clang_analyzer_printState() {}

  void test_single(signed char c) {
      if ((unsigned int)c <= 200) {
          // 'c' supposed to be restricted to be non-negative.

          // The associated value range is not even close to the expected range.
          // 'c' is assumed to be within [-128,-56]
          // but it should be within [0,127].
          clang_analyzer_printState();

          // In fact, the execution of the program prints non-negative numbers.
          printf("%d\n", (int)c);
      }
  }

  #ifdef WITH_MAIN

  int main() {
    // All possible signed char values are tested.
    for (int i = -128; i <= 127; ++i) {
      test_single(i);
    }
  }

  #endif

In the example, we can see that the runtime behavior of the code differs from
the abstract machine that we are symbolically model.
As you can see, any bugreports in that path that depend on the value 'c' are
inherently false-positives.

It can materialize in false-positives by ArrayBound V1 and V2 checkers, but
supposedly any other checkers could also suffer from this:

  buf[(unsigned int)c]

At the indexing operation, the analyzer is sure that the 'index' is *negative*,
thus accessing an element prior to the first valid element of the array.
While the user sees that the index expression is of an *unsigned* type, so they
are confused with a reason.

AFAIK all released clang versions are affected by this behavior, according to
Compiler Explorer.

I can understand that 'omitting' symbolic integral casts were a /good/ idea
when we had a /dumb/ constraint solver.
That was capable of looking up constraints for symbolic expressions in a
*verbatim* manner.
However, this is no longer true. The EQ classes and other witty stuff made our
solver *much* smarter.
And the /new/ non-fixpoint-iterative simplifier by Gabor Marton is just making
it even more so.
BTW it's fascinating to see where it goes.

I think the way forward would be to emit symbolic casts whenever necessary and
see through them in the solver.
For the record: Denys proposed several changes in this area, simplifying the
representation of widening and narrowing symbolic cast chains, but I think
that's a slightly different topic.
That wouldn't immediately arm the solver with the necessary capabilities to see
through the casts AFACT.

Quuxplusone commented 3 years ago

Constraint solver improvements *may* have improved the situation to the point
where enabling casts is actually going to do more good than harm. If so, it'd
be amazing and solve a lot of problems for us (including, but not limited to,
https://bugs.llvm.org/show_bug.cgi?id=44114). But I'll believe it when I see it.

It was pretty bad last time I tried and I don't think my specific concerns were
addressed yet. People are careless about integral types all the time and most
of the time it's harmless. Execution paths that involve overflows, truncations,
or sign extensions, are very difficult to explain to a person unfamiliar with
all these rules and may seem like false positives unless a really good system
of notes is implemented; we may have to suppress such paths entirely. See also
my thoughts in https://reviews.llvm.org/D103096

P.S. Folks, I'm still shocked that you're trying to use array bound checker in
production. It's terribly broken due to improper loop modeling. I'm fairly
confident that it'll only ever work as a taint-based checker (warn about under-
constrained tainted-symbol indexes) where concrete indexes introduced by loop
unrolling don't intervene.

Quuxplusone commented 3 years ago

Constraint solver improvements may have improved the situation to the point where enabling casts is actually going to do more good than harm. If so, it'd be amazing and solve a lot of problems for us (including, but not limited to, https://bugs.llvm.org/show_bug.cgi?id=44114). But I'll believe it when I see it. I thought memory modeling and the strict-aliasing madness is an orthogonal topic to evaluating casts according to the semantics of C/C++.

It was pretty bad last time I tried and I don't think my specific concerns were addressed yet. People are careless about integral types all the time and most of the time it's harmless. IDK.

Execution paths that involve overflows, truncations, or sign extensions, are very difficult to explain to a person unfamiliar with all these rules and may seem like false positives unless a really good system of notes is implemented; Yes, but that's an engineering problem. I agree it's indeed a difficult problem. This doesn't justify modeling a completely infeasible execution path, like in the example in this ticket.

we may have to suppress such paths entirely. IDK.

See also my thoughts in https://reviews.llvm.org/D103096 I cannot immediately see why we could not make a BRVisitor explaining the effect of the interesting casts.

P.S. Folks, I'm still shocked that you're trying to use array bound checker in production. It's terribly broken due to improper loop modeling. I'm fairly confident that it'll only ever work as a taint-based checker (warn about under-constrained tainted-symbol indexes) where concrete indexes introduced by loop unrolling don't intervene. Some clients are using the ArrayBounds[V2] checkers because they see value in them. It has proven to be useful, finding true-positives. On the other hand, they also tend to produce reports that are hard to follow. It's a personal preference what FP/TP rate is still acceptable.

I think taint analysis is great and many other checkers should consider using it, but let's not digress.

What I really want to conclude is that on the path within the if block, we should not associate a range with c that doesn't even intersect with the expected range [0,127]. If it were a super-range of the expected range, let's say c: [-128, 127], it would be a correct approximation, unlike the one we have right now.

Quuxplusone commented 3 years ago

I thought memory modeling and the strict-aliasing madness is an orthogonal topic to evaluating casts according to the semantics of C/C++.

Modeling casts basically means achieving type-correctness. Type-correctness is helpful in a lot of places. In that example we can't know how many bytes does the symbol binding overwrite if we don't know the type of the symbol. And this is one of the places where the type of the symbol is the only piece of type information available.

This doesn't justify modeling a completely infeasible execution path, like in the example in this ticket.

A change that introduces 5-10 impossible-to-understand or outright false reports for every false positive it eliminates is not justified either.

Quuxplusone commented 3 years ago

> In that example we can't know how many bytes does the symbol binding
overwrite if we don't know the type of the symbol. And this is one of the
places where the type of the symbol is the only piece of type information
available.

Could you be slightly more specific? I could not follow.

From my perspective, we simply bind an incorrect symbolic value to the cast
expression when we evaluate it.
We should not simply model a cast expression by binding the value of the castee
to the cast expression, at least not in the general case.
Right now, we basically only look at the bitwidth of the types besides some
truncation detection later:

  SVal SValBuilder::evalIntegralCast(ProgramStateRef state, SVal val,
                                     QualType castTy, QualType originalTy) {
    // No truncations if target type is big enough.
    if (getContext().getTypeSize(castTy) >= getContext().getTypeSize(originalTy))
      return evalCast(val, castTy, originalTy);

However, I think we should emit a symbolic cast if we are casting from signed
to unsigned.
In what specific way would make the analysis worse by emitting a symbolic cast
there compared to the current situation?

> A change that introduces 5-10 impossible-to-understand or outright false
reports for every false positive it eliminates is not justified either.
Could you please add some context?
Do you imply that the solver cannot see through symbolic casts, thus we would
have more false-positive bug paths depending on values derived from such
symbolic casts, am I right?

Quuxplusone commented 3 years ago

Hey, guys. I've just updated the summary of my revision. You can see the description of the approach in details https://reviews.llvm.org/D103096

Talking about your snippet:

void test_single(signed char c) { if ((unsigned int)c <= 200) { // 'c' supposed to be restricted to be non-negative. // but it should be within [0,127].

Here we reason about the range of c's particular representation as uint. The range of c should be [-57, 127]. That's how C++ rules works and my solution proposes.

Quuxplusone commented 3 years ago

(In reply to Denys Petrov from comment #5)

Here we reason about the range of c's particular representation as uint. The range of c should be [-57, 127]. That's how C++ rules works and my solution proposes.

Oh, I miscalculate the range of 'c'. Of course it should be [0,127]. I've just tested the example with my solution (https://reviews.llvm.org/D103096) and it produces [0,127] as expected. I'll add the test case to tests.

Quuxplusone commented 3 years ago

Good to hear, thanks for working on this topic.

Quuxplusone commented 3 years ago

Proposed a fix https://reviews.llvm.org/D103096

Quuxplusone / LLVMBugzillaTest

[analyzer] integral casts are not modeled correctly #50006