Spike: Diagnose memory error with `bigint` ranges

bradcray commented 6 years ago

Issue #9865 is blocked because of a valgrind failure in spite of performing operations that seem to work with other scalar and record types. This makes me worried that we have a memory error with our bigint implementation that needs resolving (and, indeed, running my tests under #9865 results in valgrind failures). This issue asks for these memory issues to be diagnosed and resolved.

mppf commented 6 years ago

@bradcray - maybe this is just a TODO for you, but if you link to your bigint range branch, somebody else could investigate.

bradcray commented 6 years ago

The branch is here:

https://github.com/bradcray/chapel/tree/bigint-ranges

And I believe should currently be in a state where it hits a valgrind error if this test is compiled and run:

https://github.com/bradcray/chapel/blob/bigint-ranges/test/types/range/userAPI/bigintRangeAPI.chpl

I'd hope to isolate this into a standalone failure that didn't require my branch which is why I didn't point to this previously.

bradcray commented 5 years ago

After investigation, I'm increasingly convinced that the source of the problem is due to the range type rather than the bigint type and that it probably relates to treating ranges as a POD type (where they wouldn't be if the bounds were bigint's).

Here's a minimal test that shows valgrind errors on this branch for anyone who wants to follow along at home:

use BigInteger;

var lo: bigint = 1;
var hi: bigint = 10;

var r = lo..hi;
writeln(r);
writeln(r);

mppf commented 5 years ago

@bradcray - probably obvious, but have you looked at removing pragma "plain old data" from the record range declaration in ChapelRange.chpl?

bradcray commented 5 years ago

Yeah, that's the experiment I described in my email to you last night that didn't work and which made me wonder whether you're aware of other assumptions built into the compiler that ranges are POD. My next step will be to look for places that specialize on FLAG_RANGE to see if any of them qualify.

mppf commented 5 years ago

There was a while when ranges were in-intent-by-default for tasks but not for function arguments...

bradcray commented 5 years ago

It's not obvious to me how different intents for ranges would cause their bounds to be freed prematurely...?

mppf commented 5 years ago

The compiler might take shortcuts that are no longer valid, to implement special cases that were there for another reason (e.g. the different intents).

bradcray commented 5 years ago

I understand why the valgrind error is happening now, enough to consider this diagnosis spike complete, though not enough to fix it, which has both coding and potentially semantic implications.

As mentioned above, for a simple test on my bigint-ranges branch like the following:

use BigInteger;

var lo: bigint = 1;
var hi: bigint = 10;

var myR = lo..hi;
writeln(myR);
writeln(myR);

we get a use-after-free error where the bigint memory is freed due to the first writeln() call itself. However, if I write the equivalent program using a homemade record that approximates a range by storing two bigints and providing similar overloads for writeThis() and such, it works:

use BigInteger, MyRange;

var lo: bigint = 1;
var hi: bigint = 10;

var myR = buildR(lo, hi);
writeln(myR);
writeln(myR);

What's happening in the "real range" case is the following:

In the implementation of writeln(), the compiler is creating a tuple to represent varargs arguments, storing myR as one of the tuple elements;
In the implementation of channel.write(), it's looping over this tuple, calling write_one_internal() on each element of the tuple;
Due to the way ranges are handled within the compiler, the access of the range tuple element into a temporary variable is shallow;
When that temporary leaves scope, it is autodestroy'd. For a traditional range that just stores integers, this destruction is not problematic, but for a bigint range, the bigint bounds are also destroyed, freeing the underlying GMP types and causing problems when the range is used again in subsequent statements.

Meanwhile, for my working version using the hand-coded range, the temporary is not autodestroyed (and also when we grab the value out of the tuple, we're using PRIM_GET_MEMBER_VALUE rather than PRIM_GET_MEMBER as in the range case).

This suggests to me that the fix would either be to deep-copy the range out of the tuple when storing it into a temp or to not auto-destroy it at the end of the scope. My guess is that there is some logic that is incorrect w.r.t. to either the intents of ranges within tuples and/or potentially with assumptions that ranges are POD (which the compiler does make, but disabling as many of the cases as I've found has not seemed to make a difference). Michael seemed to run into similar-but-different issues with ranges in tuples here which led to ranges getting const in intent by default.

This whole exploration calls into question for me whether we made the right choice in making ranges const in intent by default (as opposed to const ref or potentially "the compiler can do what it wants"). While const in falls somewhere between making sense or seeming harmless for basic integer range types, it seems as though it could get quite expensive for bigint ranges (or other range types if we permit users to create their own rangeable types over time) if (say), to print them out, we need to do deep copies of their bigint bounds. So that might suggest treating them less specially and more like normal records (where we've had similar questions in the past about how/when the compiler could copy rather than referring; and/or how a user could opt into const in vs. const ref), or having their intent depend on their index type, or ...something... It would be a good example of eating our own dogfood if the compiler didn't need to have any code that differentiated between ranges and similar user-authored records where the ranges worked as well as my home-grown range record does today (without hurting performance). Maybe put another way, it feels like another instance where special cases are hurting us more than helping?

I'll be curious what @mppf thinks about all this since he made the changes to make ranges const in and has done most of the recent work related to cleaning up record memory management and tuple semantics.

mppf commented 5 years ago

@bradcray - would you consider allowing bigint ranges to have const ref intent while other ranges have const in?

Part of the goal of const in was to reduce the potential for race conditions. I think there might be cases where we can avoid communication/ RVF better since ranges are const in.

I'd lean towards leaving ranges as const in and fixing what seems to be a compiler bug.

Due to the way ranges are handled within the compiler, the access of the range tuple element into a temporary variable is shallow;

That sounds like the main problem - or very close to it - to me.

bradcray commented 5 years ago

@mppf:

I'd lean towards leaving ranges as const in and fixing what seems to be a compiler bug.

This is where I'm thinking we should start as well, and then worry about bigint ranges being too expensive when we hit that point. Is it clear to you, though, whether the compiler bug is "the range should've been deep-copied out of the tuple to begin with" or "the shallow range temp reference should not be autodestroyed?" From your second comment:

the access of the range tuple element into a temporary variable is shallow;

it sounds like you think it's the former?

I've also been trying to figure out how to make a standalone bug out of this that doesn't depend on my branch, but it's complicated by the fact that ranges only support POD types today and POD types aren't sensitive to this behavior.

would you consider allowing bigint ranges to have const ref intent while other ranges have const in?

Eventually, potentially yes. Though I think the ideal would be that we could express this as the range author in module code (as an end-user would/could) rather than through specializations of range.idxType in the compiler...

But again, I'm comfortable crossing this bridge once it becomes a problem rather than worrying about it from the outset. It was just interesting in that it made const in less obviously correct for all ranges (where that seemed less disputable when we only had a small, finite set of possible range types).

mppf commented 5 years ago

Is it clear to you, though, whether the compiler bug is "the range should've been deep-copied out of the tuple to begin with" or "the shallow range temp reference should not be autodestroyed?"

There are two options for correctly managing record memory:

A given variable referring to a record is a deep copy (init copy / copy initialize / etc)
A given variable referring to a record is a reference

Sometimes the compiler historically would do

A given variable referring to a record is a shallow copy (that hopefully isn't auto destroyed).

I consider all occurrences of (3) to be a bug. A given piece of code can choose on a case-by-case basis whether to do (1) or (2). In particular I'd expect the compiled pattern that accesses the record field to get a reference to the field rather than its value. This might amount to the difference between PRIM_GET_MEMBER and PRIM_GET_MEMBER_VALUE.

chapel-lang / chapel

Spike: Diagnose memory error with `bigint` ranges #11278