att / ast

AST - AT&T Software Technology
Eclipse Public License 1.0
560 stars 152 forks source link

`union Value` causes bugs #1150

Open krader1961 opened 5 years ago

krader1961 commented 5 years ago

Cppcheck has been issuing the following warnings from day one:

[src/cmd/ksh93/sh/streval.c:348] portability (invalidPointerCast):
 Casting between integer* and long double* which have an incompatible binary data representation.
[src/cmd/ksh93/sh/streval.c:944] portability (invalidPointerCast):
 Casting between integer* and long double* which have an incompatible binary data representation.

But that is only the most obvious problem with how union Value is used. Simply changing it from a union to struct breaks ksh so badly you can't even start an interactive shell with no dot files. So clearly there are multiple places that are retrieving a value using a member different than was used to store the value.

I'm going to remediate this state of affairs by turning it into a struct with a hidden union. Then introduce a setter function that will record which union member was used for the store and a getter function that validates the same union member is used for the fetch.

krader1961 commented 5 years ago

Here's an example of something that is more or less harmless but obviously wrong from a logic perspective:

if (np && np->nvalue.cp) np->nvalue.rp->help = tp->help;

That is from src/cmd/ksh93/bltins/typeset.c. Obviously if we're using the struct Ufunction *rp union member we can't possibly have stored a char pointer in the union.

krader1961 commented 5 years ago

Another example from the same typeset.c module:

if (np->nvalue.ip && np->nvalue.rp->hoffset >= 0) {

Again, we can't have both a int *ip and a struct Ufunction *rp stored in the same union. This is truly awful.

krader1961 commented 5 years ago

Below is an example of the union Value type mismatches that exist in the code. This happens when starting ksh interactively with HOME set to an empty directory. The line numbers won't match your code since the changes to introduce these diagnostics have changed some line numbers. In this case the code asserted it was storing a const char* but we used it as if it were a char*.


Error: fetching value type "sp" @ ../src/cmd/ksh93/sh/name.c:3151 in nv_isnull()
Error: stored value type   "cp" @ ../src/cmd/ksh93/sh/init.c:1922 in inittree()
``
krader1961 commented 5 years ago

The majority of the value type mismatches are between const char* and char*. Which is itself a problem. Albeit a minor, probably inconsequential, problem. But there are lots of other, more serious, value type mismatches. Such as confusing a struct pathcomp* with a const char*.

krader1961 commented 5 years ago

And another bug found as a result of addressing this issue. Function array_grow() contains these lines:

struct index_array *ap;
size = (newsize - 1) * sizeof(struct Value *) + newsize;
ap = calloc(1, sizeof(*ap) + size);

Note that struct index_array is defined thusly:

struct index_array {
    Namarr_t namarr;
    void *xp;             // if set, subscripts will be converted
    int cur;              // index of current element
    int last;             // index of highest assigned element
    int maxi;             // maximum index for array
    unsigned char *bits;  // bit array for child subscripts
    struct Value val[1];  // array of value holders
};

Where the final struct Value val[1] used to be union Value val[1].

The bug is in this expression: (newsize - 1) * sizeof(struct Value *). Note the sizeof() returns the size of a pointer. It should be the size of the structure. This is theoretically broken even when it was a union because there is no guarantee the size of union Value is the same as the size of a pointer to the union. It could be larger. Even worse is that the original code is misleading since we're not allocating an array of pointers to a union Value we're allocating an array of that union type.

Jeebus, but the quality of this code is more awful than I could have imagined.

krader1961 commented 5 years ago

TLDR: Every single time you see an explicit cast in the code you should assume something questionable is being done and that the code only works more or less by accident due to a combination of the C language spec (in this case about how objects are aligned) and typical CPU architectures (e.g., size of an int versus a pointer).

Another example of how union Value has been misused in a manner broken by my change to turn it into a struct that is larger than the size of a single pointer.

https://github.com/att/ast/blob/b56dd2aed1ba11a6bbc1c52269d4e8cc7fda7c89/src/cmd/ksh93/sh/array.c#L232-L236

Pay attention to the definition of vars up and mp which have different types. Then look at this block of code:

https://github.com/att/ast/blob/b56dd2aed1ba11a6bbc1c52269d4e8cc7fda7c89/src/cmd/ksh93/sh/array.c#L261-L268

Notice the cast on line 263. The &mp is subsequently dereferenced here:

https://github.com/att/ast/blob/b56dd2aed1ba11a6bbc1c52269d4e8cc7fda7c89/src/cmd/ksh93/sh/array.c#L316

This is invalid since my change to turn union Value into a struct that looks similar to the following, highly abridged, definition causes the up->cp to access memory after the stack storage for mp:

struct Value {
    enum value_type type;
    union {
        const char *cp;
    } _val;
}
krader1961 commented 5 years ago

OMFG! The people who wrote the code in the array.c module should be forbidden from ever again writing code outside of academia. Another class of failures is due to this monstrosity:

https://github.com/att/ast/blob/917cec4cbf7e1c60e337d1b76a700cc240d42d64/src/cmd/ksh93/sh/array.c#L1334-L1335

This is another case where the code blindly assumes a pointer to a NULL pointer is equivalent to a pointer to a union Value that has been set to NULL. And, for extra awfulness, repurposes an existing pointer even though its type (Namval_t * in this instance) is unrelated to the type required by the caller.

krader1961 commented 5 years ago

I just finished a review of places that assign to a struct Value *to see if there are any more places that assign arbitrary pointers to a struct Value * from an unrelated pointer type that happens to be NULL. So the bugs I documented in my previous two comments appear to be the only places where that mistake occurs.

FWIW, There are now only two tests failing due to this change. Both involve the getopts builtin; specifically, the value of the magic $OPTIND shell var which is always zero when it should be non-zero. I'm investigating that issue.

krader1961 commented 5 years ago

I've fixed the cause of the getopts related test failures. It involved my substituting a cast to Sfdouble_t in nget_optindex() with fetching the ldp member. The debugging logic this change introduces made that easy to find because it told me the value was stored to the lp member of the struct/union but was being fetched from the ldp member.

There is now just one test failing in the alias/shcomp unit test:

<E> alias[144]: Unaliasing undefined alias should exit with non-zero status

Note that the alias unit test is passing. It's only the alias/shcomp unit test that is failing. Which is surprising since all the other failures introduced by this change have affected both the shcomp and non-shcomp test variants.

krader1961 commented 5 years ago

Strange: only the alias shcomp variant is failing consistently on Travis now and that was true a few minutes ago on my macOS server. But after removing the build directory and starting afresh the alias non-shcomp variant is failing consistently on my macOS server. Which suggests the problem is related to the layout of memory; possibly a use-after-free or similar bug.

krader1961 commented 5 years ago

And now meson test alias is not failing on my primary server. But it occurred to me to look at the test log which does contain this:

no_such_alias: alias not found
/var/folders/6t/t624jkmx7cbb9t35vzvjwj180000gn/T/ksh.alias.XXXXXX.E0lEIiQW/alias.sh[260]:
 log_error: line 61:  Unaliasing undefined alias should exit with non-zero status < 0 ? 
 Unaliasing undefined alias should exit with non-zero status :
 Unaliasing undefined alias should exit with non-zero status - 116 : arithmetic syntax error

So clearly it is failing. The fact that the alias/shcomp variant fails consistently in the expected manner is pretty strong evidence that this is due to a bug involving either an incorrectly sized allocation, a use-after-free but, or something similar. This superficially looks like an "off by one" bug involving a dynamically allocated buffer.

krader1961 commented 5 years ago

I now see why the remaining problem is, and is not, obvious when executing the non-shcomp alias test. The failure mode in my previous comment happens when the src/cmd/ksh93/tests/alias.sh unit test is not modified. The final test in that unit test fails in an obvious manner when everything up to the final three lines is removed. Which is why a day ago I noted that the non-shcomp alias test failed on my server. That was because I had removed everything but the final three lines from that unit test file. If you don't do that you get the undetected failure documented in my previous comment.

This is a very strong signal that there is an off-by-one bug somewhere in the code.

krader1961 commented 5 years ago

This is a very strong signal that there is an off-by-one bug somewhere in the code.

Which suggests there is someplace that assumes the size of union Value is equal to the size of a pointer which I haven't already fixed.

krader1961 commented 5 years ago

I'm a sad puppy. It turns out the remaining test failure related to the first set of changes to address this issue is due to my doing a stupid merging of two if (...) blocks in the unall() function.

krader1961 commented 5 years ago

TL;DR: There are probably more bugs because some union Value members are used when fetching a value but not storing it and vice-versa.

With the changes to use macros (e.g., FETCH_VT() and STORE_VT()) to manipulate struct Value objects it becomes easier to see how often each value type is used. For example, here are the counts of the stores to each of the given value types:

$ ggrep -ohP 'STORE_VTP?\(\w+(\(.+?\))?[^,]*, (\w+)[,)]' src/**.c | sed -e 's/.*, \(.*\)[,)]/\1/' | sort | uniq -c
   2 bfp
  74 const_cp
   8 cp
   2 dp
   1 fp
   1 funp
   1 i
  12 i16
  11 ip
   9 ldp
   2 llp
   5 lp
   5 np
   9 nrp
   1 pathcomp
   1 pidp
   2 rp
   1 uidp
   6 up
   1 vp

And here are the counts of the fetches of each value type:

$ ggrep -ohP 'FETCH_VTP?\(\w+(\(.+?\))?[^,]*, (\w+)[,)]' src/**.c | sed -e 's/.*, \(.*\)[,)]/\1/' | sort | uniq -c
  13 bfp
 147 const_cp
  34 cp
  10 dp
   3 fp
   1 i
   6 i16
   7 i16p
  18 ip
   7 ldp
   7 llp
  15 lp
  11 np
  25 nrp
   2 pathcomp
  49 rp
   6 up

It is expected that the counts are not identical for each type. But what is not expected is that there are stores to some types that are not fetched and vice-versa. In some cases the value type is simply unused since the type is neither stored nor fetched; e.g., char c. But in some other cases there is obviously a potential bug. For example, there is a place that stores to the pid_t *pidp; member but it isn't fetched. Also funp, uidp and vp.

krader1961 commented 5 years ago

Another likely bug is that there is a fetch from int16_t *i16p but no store to that member.

krader1961 commented 5 years ago

More potential bugs lie in statements like this from src/cmd/ksh93/sh/name.c:

ll = *(Sfulong_t *)FETCH_VTP(up, llp);

Note that llp is a int64_t* but there is no guarantee that Sfulong_t* is the same type. In fact, given that one is signed and the other unsigned it is guaranteed that this might mask bugs.

As I've mentioned many times any place you see an explicit cast you should assume something dangerous, and potentially incorrect, is being done. Even if the the authors of this code were certain that the corresponding stores involved a Sfulong_t* that were cast to a int64_t* this pattern is error prone and makes it really hard to verify that we're fetching the same type that was stored.

krader1961 commented 5 years ago

As if we need another example why the use of the Value union is so dangerous... I was looking into why the pid_t member is stored to in two places but isn't fetched anywhere. The answer is that it is fetched by this statement in nv_getval():

ll = *FETCH_VTP(up, i32p);

Notice that it is assuming a pid_t is always equivalent to int32_t. Which is almost certain to be true on any modern 32 or 64 bit int machine. But it isn't required to be true and is an example of the carelessness with which much of the name/value code was written.