Open krader1961 opened 5 years ago
Here's an example of something that is more or less harmless but obviously wrong from a logic perspective:
if (np && np->nvalue.cp) np->nvalue.rp->help = tp->help;
That is from src/cmd/ksh93/bltins/typeset.c. Obviously if we're using the struct Ufunction *rp
union member we can't possibly have stored a char pointer in the union.
Another example from the same typeset.c module:
if (np->nvalue.ip && np->nvalue.rp->hoffset >= 0) {
Again, we can't have both a int *ip
and a struct Ufunction *rp
stored in the same union. This is truly awful.
Below is an example of the union Value
type mismatches that exist in the code. This happens when starting ksh interactively with HOME
set to an empty directory. The line numbers won't match your code since the changes to introduce these diagnostics have changed some line numbers. In this case the code asserted it was storing a const char*
but we used it as if it were a char*
.
Error: fetching value type "sp" @ ../src/cmd/ksh93/sh/name.c:3151 in nv_isnull()
Error: stored value type "cp" @ ../src/cmd/ksh93/sh/init.c:1922 in inittree()
``
The majority of the value type mismatches are between const char*
and char*
. Which is itself a problem. Albeit a minor, probably inconsequential, problem. But there are lots of other, more serious, value type mismatches. Such as confusing a struct pathcomp*
with a const char*
.
And another bug found as a result of addressing this issue. Function array_grow()
contains these lines:
struct index_array *ap;
size = (newsize - 1) * sizeof(struct Value *) + newsize;
ap = calloc(1, sizeof(*ap) + size);
Note that struct index_array
is defined thusly:
struct index_array {
Namarr_t namarr;
void *xp; // if set, subscripts will be converted
int cur; // index of current element
int last; // index of highest assigned element
int maxi; // maximum index for array
unsigned char *bits; // bit array for child subscripts
struct Value val[1]; // array of value holders
};
Where the final struct Value val[1]
used to be union Value val[1]
.
The bug is in this expression: (newsize - 1) * sizeof(struct Value *)
. Note the sizeof()
returns the size of a pointer. It should be the size of the structure. This is theoretically broken even when it was a union because there is no guarantee the size of union Value
is the same as the size of a pointer to the union. It could be larger. Even worse is that the original code is misleading since we're not allocating an array of pointers to a union Value
we're allocating an array of that union type.
Jeebus, but the quality of this code is more awful than I could have imagined.
TLDR: Every single time you see an explicit cast in the code you should assume something questionable is being done and that the code only works more or less by accident due to a combination of the C language spec (in this case about how objects are aligned) and typical CPU architectures (e.g., size of an int versus a pointer).
Another example of how union Value
has been misused in a manner broken by my change to turn it into a struct that is larger than the size of a single pointer.
Pay attention to the definition of vars up
and mp
which have different types. Then look at this block of code:
Notice the cast on line 263. The &mp
is subsequently dereferenced here:
This is invalid since my change to turn union Value
into a struct that looks similar to the following, highly abridged, definition causes the up->cp
to access memory after the stack storage for mp
:
struct Value {
enum value_type type;
union {
const char *cp;
} _val;
}
OMFG! The people who wrote the code in the array.c module should be forbidden from ever again writing code outside of academia. Another class of failures is due to this monstrosity:
This is another case where the code blindly assumes a pointer to a NULL pointer is equivalent to a pointer to a union Value
that has been set to NULL. And, for extra awfulness, repurposes an existing pointer even though its type (Namval_t *
in this instance) is unrelated to the type required by the caller.
I just finished a review of places that assign to a struct Value *
to see if there are any more places that assign arbitrary pointers to a struct Value *
from an unrelated pointer type that happens to be NULL. So the bugs I documented in my previous two comments appear to be the only places where that mistake occurs.
FWIW, There are now only two tests failing due to this change. Both involve the getopts
builtin; specifically, the value of the magic $OPTIND
shell var which is always zero when it should be non-zero. I'm investigating that issue.
I've fixed the cause of the getopts
related test failures. It involved my substituting a cast to Sfdouble_t
in nget_optindex()
with fetching the ldp
member. The debugging logic this change introduces made that easy to find because it told me the value was stored to the lp
member of the struct/union but was being fetched from the ldp
member.
There is now just one test failing in the alias/shcomp
unit test:
<E> alias[144]: Unaliasing undefined alias should exit with non-zero status
Note that the alias
unit test is passing. It's only the alias/shcomp
unit test that is failing. Which is surprising since all the other failures introduced by this change have affected both the shcomp and non-shcomp test variants.
Strange: only the alias
shcomp variant is failing consistently on Travis now and that was true a few minutes ago on my macOS server. But after removing the build directory and starting afresh the alias
non-shcomp variant is failing consistently on my macOS server. Which suggests the problem is related to the layout of memory; possibly a use-after-free or similar bug.
And now meson test alias
is not failing on my primary server. But it occurred to me to look at the test log which does contain this:
no_such_alias: alias not found
/var/folders/6t/t624jkmx7cbb9t35vzvjwj180000gn/T/ksh.alias.XXXXXX.E0lEIiQW/alias.sh[260]:
log_error: line 61: Unaliasing undefined alias should exit with non-zero status < 0 ?
Unaliasing undefined alias should exit with non-zero status :
Unaliasing undefined alias should exit with non-zero status - 116 : arithmetic syntax error
So clearly it is failing. The fact that the alias/shcomp
variant fails consistently in the expected manner is pretty strong evidence that this is due to a bug involving either an incorrectly sized allocation, a use-after-free but, or something similar. This superficially looks like an "off by one" bug involving a dynamically allocated buffer.
I now see why the remaining problem is, and is not, obvious when executing the non-shcomp alias
test. The failure mode in my previous comment happens when the src/cmd/ksh93/tests/alias.sh unit test is not modified. The final test in that unit test fails in an obvious manner when everything up to the final three lines is removed. Which is why a day ago I noted that the non-shcomp alias
test failed on my server. That was because I had removed everything but the final three lines from that unit test file. If you don't do that you get the undetected failure documented in my previous comment.
This is a very strong signal that there is an off-by-one bug somewhere in the code.
This is a very strong signal that there is an off-by-one bug somewhere in the code.
Which suggests there is someplace that assumes the size of union Value
is equal to the size of a pointer which I haven't already fixed.
I'm a sad puppy. It turns out the remaining test failure related to the first set of changes to address this issue is due to my doing a stupid merging of two if (...)
blocks in the unall()
function.
TL;DR: There are probably more bugs because some union Value
members are used when fetching a value but not storing it and vice-versa.
With the changes to use macros (e.g., FETCH_VT()
and STORE_VT()
) to manipulate struct Value
objects it becomes easier to see how often each value type is used. For example, here are the counts of the stores to each of the given value types:
$ ggrep -ohP 'STORE_VTP?\(\w+(\(.+?\))?[^,]*, (\w+)[,)]' src/**.c | sed -e 's/.*, \(.*\)[,)]/\1/' | sort | uniq -c
2 bfp
74 const_cp
8 cp
2 dp
1 fp
1 funp
1 i
12 i16
11 ip
9 ldp
2 llp
5 lp
5 np
9 nrp
1 pathcomp
1 pidp
2 rp
1 uidp
6 up
1 vp
And here are the counts of the fetches of each value type:
$ ggrep -ohP 'FETCH_VTP?\(\w+(\(.+?\))?[^,]*, (\w+)[,)]' src/**.c | sed -e 's/.*, \(.*\)[,)]/\1/' | sort | uniq -c
13 bfp
147 const_cp
34 cp
10 dp
3 fp
1 i
6 i16
7 i16p
18 ip
7 ldp
7 llp
15 lp
11 np
25 nrp
2 pathcomp
49 rp
6 up
It is expected that the counts are not identical for each type. But what is not expected is that there are stores to some types that are not fetched and vice-versa. In some cases the value type is simply unused since the type is neither stored nor fetched; e.g., char c
. But in some other cases there is obviously a potential bug. For example, there is a place that stores to the pid_t *pidp;
member but it isn't fetched. Also funp
, uidp
and vp
.
Another likely bug is that there is a fetch from int16_t *i16p
but no store to that member.
More potential bugs lie in statements like this from src/cmd/ksh93/sh/name.c:
ll = *(Sfulong_t *)FETCH_VTP(up, llp);
Note that llp
is a int64_t*
but there is no guarantee that Sfulong_t*
is the same type. In fact, given that one is signed and the other unsigned it is guaranteed that this might mask bugs.
As I've mentioned many times any place you see an explicit cast you should assume something dangerous, and potentially incorrect, is being done. Even if the the authors of this code were certain that the corresponding stores involved a Sfulong_t*
that were cast to a int64_t*
this pattern is error prone and makes it really hard to verify that we're fetching the same type that was stored.
As if we need another example why the use of the Value
union is so dangerous... I was looking into why the pid_t
member is stored to in two places but isn't fetched anywhere. The answer is that it is fetched by this statement in nv_getval()
:
ll = *FETCH_VTP(up, i32p);
Notice that it is assuming a pid_t
is always equivalent to int32_t
. Which is almost certain to be true on any modern 32 or 64 bit int machine. But it isn't required to be true and is an example of the carelessness with which much of the name/value code was written.
Cppcheck has been issuing the following warnings from day one:
But that is only the most obvious problem with how
union Value
is used. Simply changing it from a union to struct breaks ksh so badly you can't even start an interactive shell with no dot files. So clearly there are multiple places that are retrieving a value using a member different than was used to store the value.I'm going to remediate this state of affairs by turning it into a struct with a hidden union. Then introduce a setter function that will record which union member was used for the store and a getter function that validates the same union member is used for the fetch.