Request: Allow deferred initialization of a variable

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.78k stars 420 forks source link

Request: Allow deferred initialization of a variable #14271

Closed BryantLam closed 4 years ago

BryantLam commented 5 years ago

Related #14104

Feature Request. Allow deferred initialization of a variable.

As of 1.20, I cannot delay the initialization of a variable. I propose that we can delay/defer the initialization of any variable in order to allow easier initialization of a non-nilable class type, especially if the initialization criteria is complicated. Here's some pseudo-code:

/* Case 1 */ {
  var x: owned MyClass = defer;
  x = new owned MyClass();
}

/* Case 2 */ {
  var x;
  x = new owned MyClass();
}

/* Case 3 */ {
  var x;

  if someCondition {
    x = new owned MyClass(true);
  } else {
    x = new owned MyClass(false);
  }
}

/* Case 4 */ {
  var x: owned MyClass = defer;

  if someCondition {
    x = new owned MyClass(true);
  }
  x = new owned MyClass(false);
}

For a non-nilable class type, the initialization has to occur before it can be used. In Case 4, inserting a use of x before it is initialized is illegal. The lifetime of a deferred variable would not start until it was initialized.

There might be a smarter way to do it without having to = defer on a variable with known type, but I can imagine how it could be a sticky situation when users expect the default initialization to occur (deferring initialization wouldn't activate the default initialization).

lydia-duncan commented 5 years ago

Is #8792 along the lines of what you were thinking of?

mppf commented 5 years ago

We have talked before about allowing

  var x: owned MyClass;
  ... code not using x ...
  x = new owned MyClass();

However this arguably makes it harder to identify initialization vs assigment.

bradcray commented 5 years ago

One other potential concept here would be to require the deferred initialization to use init= to set the variable so that the initializing assignment is clearly called out. This probably isn't necessary for the compiler, but could help the reader/writer of the code comprehend it and/or head off dumb mistakes by requiring the user and compiler to agree on what was being written.

bradcray commented 5 years ago

Not explicitly mentioned in the original issue, but this deferred initialization could potentially also be applied to declarations other than non-nilable that require an initializer such as param and const declarations.

A question in all of these cases is whether a (concrete) type would have to be specified or whether it could be inferred from the first initialization. My current preference is to require the type, both because it's a more conservative starting position that we can relax later on, and because of fear of creating tangled webs of initialization, like:

var x = defer;
var y: x.type;
var z: int(y.numBits());
x = 3.4;

mppf commented 5 years ago

Just to be clear, my current preference is to support case 2 and 3 from the above (i.e. I don't think it's necessary to mark the un-initialized variables with defer or noinit and I don't think it's necessary to mark to mark the initialization with init=). I think adding special syntax for these cases will present 2 problems: 1) it requires people to learn the special syntax to use the feature (so it will not be available to beginners, for example) and 2) it makes it harder to update the code (in particular a user would have to track the first assignment and change it to init= manually). Note that for class/record init functions, we considered marking the initialization statements with e.g. init=, but opted to simply use =.

A question in all of these cases is whether a (concrete) type would have to be specified or whether it could be inferred from the first initialization. My current preference is to require the type, both because it's a more conservative starting position that we can relax later on, and because of fear of creating tangled webs of initialization, like:
var x = defer;
var y: x.type;
var z: int(y.numBits());
x = 3.4;

I view this as a minor problem (the problem being only that it makes the implementation slightly more complex). In this case, the compiler would issue an error upon x.type because x is used before it is initialized (and before its type is known). As far as I know, such behavior will detect any backwards-leading inference edges and other than that, the written order will be used to determine types and do initialization.

cassella commented 5 years ago

If you don't need to give the variable a type or a = defer, does that mean you'd write just var x;? Could that be initialized to any type (as long as it's the same type in all initialization branches), or only a non-nilable class? (If it's initialized to 7 in one branch and not initalized in others, would the other branches default-initialize it to 0?) (Could it be initialized to different types in different branches of a param conditional?)

Could the initialization be performed in a function taking the variable by ref or out, or say a new init intent? (I could see wanting to have a helper function to initialize a set of related variables consistently in multiple callsites proc setup_coords(init x, init y, init z))

Would this functionality be available for non-nilable class members of records or classes? That is, could they be left uninitialized after this.complete(), so long as they end up initialized-before-use before the end of the initializer? (Though presumably the reason to do that is to allow a method to do the initialization, which sounds harder for the compiler to reason about.) If the declaration is typeless there, would it make the object generic?

mppf commented 4 years ago

If you don't need to give the variable a type or a = defer, does that mean you'd write just var x;?

yes

Could that be initialized to any type (as long as it's the same type in all initialization branches), or only a non-nilable class?

any type

(If it's initialized to 7 in one branch and not initalized in others, would the other branches default-initialize it to 0?)

No, that case would simply be an error, at least to start

(Could it be initialized to different types in different branches of a param conditional?)

I think so.

Could the initialization be performed in a function taking the variable by ref or out, or say a new init intent? (I could see wanting to have a helper function to initialize a set of related variables consistently in multiple callsites proc setup_coords(init x, init y, init z))

There is already an out intent. We could use that do allow the initialization. However the type must be established before the function can be called with an out intent.

Note that the idea of deferred initialization allows 2 things:

supplying the initialization expression later
supplying the type later.

The out intent can only help with the 1st of these.

Would this functionality be available for non-nilable class members of records or classes? That is, could they be left uninitialized after this.complete(), so long as they end up initialized-before-use before the end of the initializer? (Though presumably the reason to do that is to allow a method to do the initialization, which sounds harder for the compiler to reason about.) If the declaration is typeless there, would it make the object generic?

While we can consider future extensions, I don't think we're planning to change initializers at all in this proposal. Note though, for that idea long-term, that this.complete() exists so that there is a point at which the object is fully initialized, so that method calls (e.g.) can only access initialized data. Allowing initialization of a field after this.complete() would break this property (which initializers were designed to preserve).

mppf commented 4 years ago

In starting to implement this, I am wondering about the following example

{
  var x: MyRecord;
  x = new MyRecord(...);
}

According to the current language definition, this is default initialization and then assignment. However, when we add deferred-initialization, I think it would be more consistent to consider it to be a case of deferred-initialization, so that it behaves the same as if the type is left out:

{
  var x;
  x = new MyRecord(...); // no '=' call here, this is initialization
}

Put another way, I don't want adding types to variables to make a program slower.

Note though that I have been advocating for var x; to be available as opposed to var x = defer;. If we choose to always required the deferred initialization to be decorated with = defer, both cases above would use = defer to opt in and I'd suppose that we wouldn't change the existing pattern.

I like that adjusting the language in this way will allow such programs to be more optimal (since the initialization pattern is cheaper than assignment).

mppf commented 4 years ago

Here is an example that might be interesting for the choice of whether or not to require = defer or = noinit:

proc g() {
  var A: [1..10] int;
  return A;
}
proc f() {
  var B: [1..10] int;
  on Locales[1] {
    B = g(); // is this deferred initialization?
  }
  return B;
}
f();

If deferred initialization is used here, then the array B might actually be stored on Locales[1] rather than Locales[0]. Would that be confusing? Or is it likely to be irrelevant? Or even intended?

The above pattern comes up in the standard modules a lot as a workaround for not being able to return from an on statement. Issue #13587 discusses whether returning from an on statement should be allowed and points out that one might want to return a non-nilable class type from an on statement. Arguably deferred initialization is an alternative solution there - but we need to decide if it is an "opt-in feature" or "just part of the language".

bradcray commented 4 years ago

I don't think deferring an array's initialization should cause the array to be allocated somewhere else. Put another way, I think an array's "home locale" should be the one upon which the array was declared, not the one where its elements were initialized. If the array is local (not distributed), that suggests that the array's elements should be stored on the locale on which it was declared. Recall that in the noinit feature, we discussed a phase of initialization that would be required in order to make the variable a legal instance of that data structure and another phase of initialization that would set it to the default value, where noinit would cause the first phase to run, but not the second. I'd imagine that same rule to apply to deferred initialization (i.e., the array would be allocated and its descriptor set up at declaration time, particularly if an array type were specified; but its elements would not be zeroed out / assigned).

If it were me, I'd probably take the approach of requiring a = defer[red]; initialization initially, and look into relaxing it later rather than jumping right into an automatic inference of when the user might have wanted deferred initialization. That supports the desired feature and patterns without requiring the compiler to be as clever about it out of the gate (if ever) and lets us defer (ha!) tougher questions.

mppf commented 4 years ago

@bradcray - thanks for your thoughts.

If it were me, I'd probably take the approach of requiring a = defer[red]; initialization initially, and look into relaxing it later

Right, but relaxing it would be a breaking language change (because user records would have a different pattern of init copies / assignments). Part of what I'm trying to do on the implementation front is to figure out what the issues are (both for implementation and language design). By experimentally enabling the split initialization by default, I can get more information about patterns that cause problems. In any case, I will see how far I can go on that path and I'm not stuck to that strategy.

If the array is local (not distributed), that suggests that the array's elements should be stored on the locale on which it was declared.

Does that mean that - if we allow returning from on statements - then the array would be stored on the locale on which it was created (and not the locale where it ends up)?

Anyway, ignoring the returning-from-on issue for the moment, I can see 2 possible directions to make the example work as you are suggesting. Below I've included the example again.

(For now I am assuming that the compiler can identify B = g() as the initialization; we can consider requiring the user to write B init= g() or something to identify it but I think that is a separate question).

  var B: [1..10] int = defer; // or implicit or whatever syntax we choose
  on Locales[1] {
    B = g(); // how to make B's elements end up on Locale 0 ?
  }

Here are the possible directions:

Call a special function for split initialization. This is most similar to the draft noinit design. Space for B's elements would be allocated on Locale 0 in the first line. Then, in the assignment line, the compiler would arrange to call a special assignment function. It would need to be a special assignment function because it needs to copy-initialize into the elements rather than assign them (because the elements are not initialized). A straw proposal for this would be to have proc =( ref lhs, rhs, param lhsWasDeferredInited: bool). An alternative would be to have it be proc init=(rhs, param deferredInitThis:bool) where the partially initialized object would be communicated through the this argument.
Instead of changing B = g() into just a move, the compiler could call a data-structure-defined function. In the case of these arrays, the data-structure-defined-function would notice that this was on a different locale from the data and take steps to address that. This could go as far as adding something like C++ move constructors to Chapel - but I would advocate limiting it to cases where a value is moved across locales (so it would not need to be called for normal returns, say). Three straw proposals are to add proc move=(lhs) or to add proc postmove() or proc postmove(fromLocale:locale). These would be called only for a move that (potentially) crosses locales. (Or for values with runtime types; see below).

Here are what I consider to be the key questions we'd ask to choose between these:

do we want a strategy for records to handle moves across locales outside of split initialization?
in the array case, should B's elements be allocated at B's declaration or at the time of B = g() in the example?
is deferred initialization the same feature as noinit or a different one?
can both strategies work for values with runtime types?

more about moves across locales

Historically, when writing the string data type, its author really wanted to use a narrow c_ptr field for the buffer. But, the compiler moved strings across locales without informing the the data type in any way and as a result a workaround was applied - string effectively uses a wide pointer for the buffer.

Some of the ongoing Sort work relies on being able to PUT/GET records to "move" them temporarily during the sort. One nice feature of something like postmove() is that it can be called in bulk on the elements (possibly even after the sort) and in a way that does not interfere with bulk transfer of many elements. But, we could also simply decide that such moves can happen to records in Chapel and that there is no "hook" to handle it.

more about where allocation occurs

We'll consider a code example that has a deferred array initialization within the same locale, too and compare the behavior of the two strategies.

In strategy 1:

  var A: [1..10] int = defer; // elements allocated here
  A = g(); // other elements allocated within g(); then these are copy-initialized into A

  var B: [1..10] int = defer; // elements allocated on Locale 0 here
  on Locales[1] {
    B = g(); // other elements are allocated on Locale 1 in g; these are copy-initialized into A on Locale 0
  }

In strategy 2:

  var A: [1..10] int = defer; // nothing is allocated here
  A = g(); // other elements allocated within g(); then the result of g() is moved into A

  var B: [1..10] int = defer; // nothing is allocated here
  on Locales[1] {
    B = g(); // other elements are allocated on Locale 1 in g; then the array `postmove` allocates these elements within B and sets them
  }

is noinit the same as deferred initialization or different?

As I understand it, noinit was originally motivated by a desire to allocate arrays without doing first-touch to initialize elements. I think it's a feature that is primarily motivated by arrays. In contrast, the deferred initialization feature request is primarily motivated by non-nilable types (or, more generally, types that cannot be default-initialized).

Additionally, I think it was expected for noinit that one would be able to set specific elements. In contrast, I would expect deferred initialization to only work when the entire array is initialized.

In other words, I would expect this to be a legal noinit scenario:

var A: [1..10] int = noinit;
forall i in 1..10 {
  A[i] = i;
}

while I would expect a deferred initialization strategy would require syntactically setting the entire array:

var A: [1..10] int = defer;
A = forall i in 1..10 do i;

Lastly, in noinit designs, we have always had problems with records. For example:

config const elt = 1;
var A: [1..2] MyRecord = noinit;
A[1] = new MyRecord(1); // is this assignment or copy-initializing/moving the element?
A[elt] = new MyRecord(i); // copy-initialize if elt==2 but assign if elt==1 ?

I know of these strategies for a noinit design to address these problems with records:

track at runtime which elements have been initialized and handle that ... somehow
only apply to POD types
syntactically differentiate between initialization and assignment (i.e. make it the user's problem)

I think the last of these would make noinit a feature only accessible to advanced users.

Additionally, https://github.com/chapel-lang/chapel/issues/8792#issuecomment-447048125 points out that the noinit design also needs a way to identify when the initialization is complete. If the compiler cannot infer this (e.g. by requiring something like A = forall ...) then we will need a user-facing way to indicate it (e.g. A.complete()).

And, related issue #14273 discusses a strategy for marking the block in which an array is initialized element by element.

runtime types

What if we modify the example to have two arrays with different domains with compatible bounds?

const D1 = {1..10};
const D2 = newBlockDom(1..10);

proc g() {
  var X: [D1] int;
  return X;
}

proc f() {
  var Y: [D2] int = defer;
  Y = g();
}

The main question here is what Y = g() does. In strategy 1, it copy-initializes to the array allocated at Y's declaration point and this works in a straightforward manner with runtime types.

For strategy 2, it will need some modification to handle this case. Here, in Y = g(), we imagine that Y is set to the result of g() and then Y.postmove() is called. Now Y has the wrong runtime type if it came from g(), so Y.postmove() would need to receive the declared runtime type of Y somehow. It would notice that the current runtime type of Y and the declared runtime type are different; as a result it would allocate the elements with that runtime type and then copy-initialize them. This is actually not very different from the steps taken when Y is declared on a different locale from the initialization - mainly, it just adds an additional case to check.

postmove would need to be able to accept the runtime type of Y somehow. Since runtime types are only used for arrays and domains in Chapel, I think it would be most straightforward to simply add a domain argument to array.postmove and a distribution argument to domain.postmove. (This could alternatively be done by expecting this.type to include the runtime type or by adding a type argument).

Arguably such a strategy would help with the non-deferred initialization case as well

  var Y: [D2] int = g();

which the compiler currently handles by converting it to default-init and then assign.

bonus topic: init= across locales

The above examples focus on the case in which the compiler transforms initialization into a move but it will use init= in other cases. What would that look like?

var otherRecord: MyRecord;
var r = defer;
on ... {
  r = otherRecord;
  // calls r.init=(otherRecord);
  // init= method is responsible for any special handling if `this` is remote
}

@benharsh points out that this might make the this arguments to init= need to be wide, which might not be ideal.

mppf commented 4 years ago

@benharsh and I were discussing the option of simply not allowing deferred initialization across on statements.

Earlier, I've wanted deferred initialization across on-statements as a way to work around issue #13587 (can't return from an on statement):

proc makeTheThingThere() {
  on Locales[there] {
    return new owned C(); // not currently allowed (see issue 13587)
  }
}

I could change the above into this:

proc makeTheThingThere() {
  var ret: owned C; // error: can't default initialize non-nilable owned
  on Locales[there] {
    ret = new owned C();
  }
  return ret;
}

but that fails because ret can't be default-initialized. So I might imagine using deferred initialization:

proc makeTheThingThere() {
  var ret: owned C = defer;
  on Locales[there] {
    ret = new owned C();
  }
  return ret;
}

@benharsh suggested that perhaps we simply need to use the on somewhere var x = ... variable declaration syntax, as in:

proc makeTheThingThere() {
  on Locales[there] var ret = new owned C();
  return ret;
}

I think this could work. It is even possible to handle a critical section. Suppose we wanted to write this:

proc makeTheThingThere() {
  var ret: owned C = defer;
  on Locales[there] {
    lock();
    ret = new owned C();
    unlock();
  }
  return ret;
}

but could not because split initialization does not work across on statements. What would we write?

proc makeTheThingThere() {
  on Locales[there] var ret = makeTheThingHere();
  return ret;
}

// option 1: using defer { }
proc makeTheThingHere() {
  lock();
  defer { unlock(); }
  return new owned C();
}

// option 2: using split initialization - no 'on' statement involved here though!
proc makeTheThingHere() {
  var ret: owned C = defer;
  lock();
  ret = new owned C();
  unlock();
  return ret;
}

bradcray commented 4 years ago

but relaxing it would be a breaking language change

To be clear, when I say "later" I'm not saying "several releases for now." I'd also be fine with committing to = defer sooner (now?) as I think it has the advantage of being more explicit and avoiding some of the understandable and subtle "What does this mean?" types of questions for users and readers of Chapel code that you raise above. It also has the advantage of making cases that have been errors for years (or that we've said would be errors) not suddenly cease to be errors.

Does that mean that - if we allow returning from on statements - then the array would be stored on the locale on which it was created (and not the locale where it ends up)?

I think I'd need to see an example to answer that. If the expression was return [1, 2, 3]; then I think that literal array would be created on the locale where the return was executing, but then be copied out to the array to which the call was being assigned. If the array was declared outside the on-clause, I'd expect its elements to be stored on that locale regardless of where a return of the array was located.

As I understand it, noinit was originally motivated by a desire to allocate arrays without doing first-touch to initialize elements.

No, I don't think that's right (though that may have come up later). I think it was created for users who didn't want to spend any time zeroing out their arrays if they were going to initialize them themselves somewhat later. So I think it is reasonably close to deferred initialization (but only in terms of execution time savings, not trying to have declarations and initializations in different scopes).

I think it's a feature that is primarily motivated by arrays.

Yeah, and by extension data types containing arrays (or other large collections).

Additionally, I think it was expected for noinit that one would be able to set specific elements.

I think that's accurate. The idea was that any read of any array element that the user hadn't written was their own fault. This was thinking only in terms of arrays of numeric data, so we weren't thinking about copy initialization vs. assignment types of issues in arrays of records, say. That said, I think we could choose to unify the two in this regard (require whole-array initialization; or potentially treat arrays of records differently than arrays of values).

I think the last of these would make noinit a feature only accessible to advanced users.

I think it would definitely be reasonable to consider noinit an advanced user feature, but in saying that, I'm not necessarily advocating for that option.

@benharsh and I were discussing the option of simply not allowing deferred initialization across on statements.

That sounds like a very attractive simplification to me.

lydia-duncan commented 4 years ago

I think the last of these would make noinit a feature only accessible to advanced users.

I think it would definitely be reasonable to consider noinit an advanced user feature, but in saying that, I'm not necessarily advocating for that option.

Apologies if I've missed this, but how does deferring initialization impact accesses to the variable that was deferred prior to when it is given a proper value? Will we have compiler analysis that always throws an error if it occurs?

If not, I would object to enabling either defer or noinit without explicitly opting it. Part of why we default initialize variables is because it is safer for users - choosing to do otherwise enables the user to potentially access bad memory. In my opinion, that makes performing either action an advanced feature and something that should be explicitly opted into.

bradcray commented 4 years ago

how does deferring initialization impact accesses to the variable that was deferred prior to when it is given a proper value? Will we have compiler analysis that always throws an error if it occurs?

Any reads of the variable before it was initialized would be flagged as illegal by the compiler.

(this shows up in the original issue's request as:

For a non-nilable class type, the initialization has to occur before it can be used. In Case 4, inserting a use of x before it is initialized is illegal.

mppf commented 4 years ago

OK, so I've gathered what I wanted to from the implementation effort as input for this design question. (Namely: that the implicit strategy passed enough testing that I am confident that I've identified the language design issues here, at least for "normal" code). So, here I will summarize the language design options ahead of us and show pros and cons.

I'm expecting that we will (eventually) support this feature for type param const var ref and const ref declarations.

The sections below discuss syntax variants. No matter the syntax variant, the compiler will analyze how the variable being split-initialized is used. The compiler must be able to identify the point(s) at which the initialization takes place (since it is not at the declaration point). Uses of the variable before that point will cause an error (or, for the implicit strategy, cause default initialization if that is possible). The compiler will be able to identify the point of initialization within blocks and conditionals but not within on, tasks, loops, or called functions. For conditionals, both sides of the conditional need to contain a point of initialization.

On to the 3 syntax variants with pros and cons. Note that I only show var a, type t, and const r in order to simplify the discussion. The 3 syntax variants are:

decorate split initialization variable declarations with e.g. var a = defer.
decorate split initialization initialization points with e.g. a init= 2.
split initialization is available with existing syntax aka "implicit".

= defer

var a = defer;
a = 2;

type t = defer;
t = int;

const r: MyGenericRecord = defer;
r = new MyGenericRecord(1);

Pros:

Only changes behavior of programs that were a syntax error before; variable declaration errors are still errors

Cons:

Users will need to learn this syntax before solving certain nilability puzzles
defer has a different meaning than in a defer block e.g. defer { unlock(); }
Does not distinguish between initialization and assignment

We might consider var a = noinit as an alternative syntax for this one, but it is unclear to me if split initialization solves enough of the motivating case of noinit for arrays.

init= operator

var a;
a init= 2;

type t;
t init= int;

const r: MyGenericRecord;
r init= new MyGenericRecord(1);

Pros:

Only changes behavior of programs that were a syntax error before; variable declaration errors are still errors
There might be other useful applications of init= as an operator in the future

Cons:

Users will need to learn this syntax before solving certain nilability puzzles
t init= int seems awkward, syntactically
we don't normally describe params or types as being "initialized" and so init= seems like a strange choice for them
init= in this context means "initializes with" and not "copy initializes with". In particular, it does not mean that a record's init= function will be called (e.g. in the r init= new MyGenericRecord(1), the only initializer called here is proc MyGenericRecord.init(arg)).
We might in the future decide that var c:MyClass; can be initialized by a call to a function with an out argument - e.g. setIt(c). If we did, we would need different syntax to decorate the out argument.

Note that we could consider a := 2; as an alternative syntax for a init= 2;.

existing syntax

var a;
a = 2;

type t;
t = int;

const r: MyGenericRecord;
r = new MyGenericRecord(1);

Pros:

Solution to nilability puzzles is more intuitive and available to beginners
The Chapel language supports more optimal code (even from beginners)
Implict-ness is symmetric with expiring values proposal (#13704). In particular, expiring values will not need to be decorated in a special way for, say, copy elision, to occur. Choosing this option, along with the proposal in #13704, would basically be saying something like "The language has rules to optimize when values are initialized and deinitialized".

Cons:

Does not distinguish between initialization and assignment
Can change the default-init/assign pattern for existing code (but this has apparently zero impact on the tests that exist, based on a prototype)
Initialization vs assignment can change with changes to surrounding code (e.g. adding a writeln)

Lastly, the discussion from https://github.com/chapel-lang/chapel/issues/14271#issuecomment-546388943 about runtime types is still relevant - however I think we can ignore those issues while considering this language design (since we can think of arrays / domains as the only types with runtime types as being "special" and then implement whatever we need to in order to get it to work). The strategy of not allowing split initialization with on addresses the other portions of that comment.

mppf commented 4 years ago

It probably comes as a surprise to nobody, but I pretty heavily favor the "existing syntax" option. However I've tried to give the other options a fair shake in the above.

cassella commented 4 years ago

t init= int seems awkward, syntactically

Would you consider t := int as an explicit initialization? This also doesn't suggest that init=() would be called.

mppf commented 4 years ago

@cassella - sure, we can consider t := int as meaning initialization, as distinct from assignment. I'll make a note of it in the summary.

bradcray commented 4 years ago

I think that's an intriguing proposal, though I think if we were to do it, we should've also named the copy initializer := rather than init=. The main downsides to the proposal that I can think of is that := has meant assignment in some key traditional languages (Pascal, Modula) and the : in Chapel tends to connote type specifications or casts, which don't really play a role in this operation.

ben-albrecht commented 4 years ago

At a high-level, I have not settled on an opinion between whether initialization vs. assignment is something we want to provide explicit syntax for (proposals = defer & init= ) or abstract away from the user (proposal existing syntax).

Of the proposals above, I like the = defer and existing syntax options.

I like = defer due to being explicit and relatively intuitive (someone with little knowledge of Chapel can probably figure out what is going on). However, I don't think we can overload the defer keyword due to the defer block having a completely different meaning/purpose. If we were to pursue this, I think we'd want to explore other options for the keyword.

I like the existing syntax proposal, but it almost feels too good to be true. I have a gut feeling that there are cases where this abstraction will leak and bite a user that is not aware of what is actually happening w.r.t. assignment vs. initialization. However, I don't have any concrete examples to back this up. Seeing the changes required to get to 0 test failures on @mppf's branch could dispel those concerns.

Beyond the cons already stated, I am not a fan of the init= or := syntax. init= is more clear, but I agree with @bradcray that it is awkward. The walrus operator (:=) is cleaner, but has a lot of different meanings across languages, making it far less intuitive.

e-kayrakli commented 4 years ago

I like = defer (or some other keyword) the best. It feels to me that this is one of the areas where I prefer being explicit for the sake of code-reader rather than easy to write for the sake of code-writer.
I wouldn't have any problem with the "existing syntax" approach. However, it'd take me some time to not be thrown off by seeing var x; s around.
I don't like init= and := in many different levels. The most fundamental one is that it'd make me do something special at the point where I am assigning to a variable the first time. Assignment is such a trivial reflex that having to think about doing something special is too much mental energy to me. I strongly believe that if we are going to do something syntactical about this (like =defer() it has to happen at variable declaration only. On a more personal level, I find x init= y too ugly and x := y already overloaded as pointed out by @ben-albrecht.

@mppf -- one thing that I want to understand more are the "nilability puzzles" that you allude to in your proposal. For example, comparing = defer and "existing syntax" approaches, what would be more difficult/unexpected with = defer that would be easier with "existing syntax"?

mppf commented 4 years ago

one thing that I want to understand more are the "nilability puzzles" that you allude to in your proposal. For example, comparing = defer and "existing syntax" approaches, what would be more difficult/unexpected with = defer that would be easier with "existing syntax"?

The only thing there is that users will have to figure out that they can use the feature. I'd argue that is easier with the "existing syntax" approach (since they don't have to learn about the keyword defer or whatever it is; their code might actually "just work").

I don't feel like I have a great nilability puzzle example, but that was the original motivation for this issue being filed. Here is one (perhaps too simplistic) example:

  var x: owned MyClass;
  {
    // block to make sure variables generated when computing `input` are destroyed
    var input = ...;
    // lots of code that computes `input`
    x = new owned MyClass(input);
  }
  // code using `x`

This kind of code used to work before the niliability changes. Now, if you try to move var x down to the new statement, you can't, because it's inside a block, so the later code won't work. You could change the structure of the code or introduce nilable versions of some variables... but this can feel awkward. (There might be a particular reason the code has particular block / control flow structure).

e-kayrakli commented 4 years ago

I see what you mean now, thanks for the clarification.

I still stand where I was: mild preference for =defer, OK with all implicit, against changing syntax for first assignment (init= or :=)

bradcray commented 4 years ago

Is the implication of this aside:

(or, for the implicit strategy, cause default initialization if that is possible).

that if I wrote this code:

var x;
writeln(x);
x = 42;

the compiler would insert a default initialization prior to the writeln() causing 0 to be printed out rather than a use-before-def error?

mppf commented 4 years ago

Just a note - I don't think any of the proposals will allow deferred initialization of globals without additional work (I'll need to make sure to disable that in the prototype branch).

E.g.

record R { var x: int = 42; }

var globalRecord:R = defer; // or whatever the syntax is
f();
globalRecord = new R(2);

proc f() {
  writeln(globalRecord);
}

mppf commented 4 years ago

(take 2)

Is the implication of this aside:

(or, for the implicit strategy, cause default initialization if that is possible).

that if I wrote this code:
var x;
writeln(x);
x = 42;
the compiler would insert a default initialization prior to the writeln() causing 0 to be printed out rather than a use-before-def error?

Almost. If the user code were

var x:int;
writeln(x);
x = 42;

then it would compile and print out 0. But the code you showed uses var x; and so the compiler doesn't have an option to use default initialization in this case. (I suppose we could consider allowing that, but I'd view it as a future-work-extension).

bradcray commented 4 years ago

OK, that's reassuring, thanks.

mppf commented 4 years ago

Seeing the changes required to get to 0 test failures on @mppf's branch could dispel those concerns.

The branch implementing "existing syntax" is here https://github.com/mppf/chapel/tree/split-init and now has 0 failures in standard (local) testing. I wouldn't describe anything I had to fix as surprising, with the minor exception of one AMR test (it was using an unsupported tuple idiom - there the mystery is why it was working on master - but the code was clearly wrong).

cassella commented 4 years ago

the compiler would insert a default initialization prior to the writeln() causing 0 to be printed out

The implications of this for records give me a bit of pause.

var r: R;
r = new R(7);

If I follow, this would be a deferred initialization, followed by initialization via R.init(7). But with a writeln(r) between the statements, it becomes a default initialization via R.init(), then the writeln, then initializing a new record via R.init(7) and invocation of the assignment operator? (And eventually deinitialization of the new R(7) separately from that of r.)

It feels a little squirrely that adding the writeln changes what function is being invoked on the next line and how many objects there are. (Which in a real program may actually be dozens or hundreds of lines later.)

Coming from my C background where things are explicit, I'd expect that if you're trying to defer r's initialization to later, it should be unitialized until you initialize it.

the compiler would insert a default initialization prior to the writeln() causing 0 to be printed out

Is that default initialization inserted actually just prior to the writeln(), or at the point of declaration?

var r: R;
if (something) {
  writeln(r);
  r = new R(7);
} else {
  writeln("hello world");
  r = new R(7);
}

Does the else block contain an initialization of r or an assignment? If R.init() and R.init(int) have different side effects, does the former happen when !something? If the writeln(r) is removed, does that change what's executed in the else block? (Or even, prior to the if()?)

cassella commented 4 years ago

Would this be legal?

var r: R;
for i in 1..10 {
  r = new R(i);
  ...
}

If it's legal, is it initialization each iteration, or assignment (default initialized before the loop)? Or initialization the first iteration and assignment the rest?

mppf commented 4 years ago

@cassella -

Re.

var r: R;
r = new R(7);

var r: R;
writeln(r);
r = new R(7);

Indeed, the 2nd one would use default-initialization and then assignment (with the "existing syntax" option).

It feels a little squirrely that adding the writeln changes what function is being invoked on the next line and how many objects there are. (Which in a real program may actually be dozens or hundreds of lines later.)

I think this is a reasonable "Con" so I've added it in my summary above.

My counter-argument to it is this: most likely programmers will view it with something like "The language has rules to optimize when values are initialized and deinitialized". In other words, if initialization / assignment / copy-init / deinit are subject to "optimization", then it is not surprising that changing code in a function will change whether the "optimization" occurs. (Please note that we expect to describe the rules for this in the spec and we will consider it applying language rules that result in optimization rather than allowing the compiler to make any transformation it wants while preserving some property. In this way it is different from typical optimizations).

In fact it is currently the case that changing code in seemingly trivial ways will change whether or not = is called. Note also that the plans in #13704 will change when copy-init and deinit are called (as compared to today) in ways that arguably favor optimizeability over it being immediately obvious in every case.

For example, with 1.20:

proc f() {
  return new R();
}
var x = f(); // no copy init or assignment

proc f() {
  return new R();
}
var x:R;
x = f(); // assignment

But I would argue that these two programs are similar enough that they should behave the same. In fact that is what the "existing syntax" proposal does. It makes more sense to me to make these two behave the same than to make the writeln case above behave the same.

If we choose the "existing syntax" option, we will be saying that we don't think it actually matters in most cases in practice for programmers to distinguish between assignment and initialization for local variables, and that well-behaved record types will support both = and copy-initialization in a consistent manner. I think that this theory is supported by my branch that gets the test suite to pass with 0 failures with this approach. (Most of the changes to existing tests on that branch have to do with tests that are trying to test = to make sure that they actually invoke it). For any unusual cases where it does matter, programmers will be able to reason about the rules the compiler applies and make their code do one or the other.

Summing all of that up - I think the language design should optimize for the common case, and I think that the common case is that record types support = and other forms of initialization in a consistent manner - in other words it is common that the difference between = and initialization is not important. I'd rather have the ability for the compiler to make some of these programs faster (default-init/assign is slower than just initialization, especially for a large arrays, say) than to have language rules more similar to C++ in this area.

mppf commented 4 years ago

@cassella - answering your other questions

the compiler would insert a default initialization prior to the writeln() causing 0 to be printed out

Is that default initialization inserted actually just prior to the writeln(), or at the point of declaration?

At the point of declaration.

var r: R;
if (something) {
  writeln(r);
  r = new R(7);
} else {
  writeln("hello world");
  r = new R(7);
}
Does the else block contain an initialization of r or an assignment?

Assignment. Because the writeln(r) in the if block, the compiler knows it cannot defer-initialize r for the if block and as a result it cannot defer-initialize r at all.

Would this be legal?
var r: R;
for i in 1..10 {
  r = new R(i);
  ...
}
If it's legal, is it initialization each iteration, or assignment (default initialized before the loop)? Or initialization the first iteration and assignment the rest?

Sure, if r can be default initialized, that is what would happen (and the loop would use assignment). The initialization point for split/deferred initialization can't be in a loop. (The initialization point can only be within regular blocks or conditionals).

gbtitus commented 4 years ago

I don't have a preference either way with regard to syntax, but I'm wondering: have you considered the interaction between first-touch and comm layer memory registration, particularly for array memory with configurations that require that in multi-locale execution? Or is that outside the scope of this particular issue?

Context: for certain configurations, for example comm=ugni on Cray XC systems, at present we wait to register array memory until after the default initializer runs, in order to improve the NUMA locality of array memory. Registration will NUMA-localize memory that hasn't been touched, but will not change the locality of memory that has been touched. Thus the locality produced by the default initializer is the one that holds. We could make this work with deferred initialization too, but we'd have to ensure the registration actually did occur before the memory could possibly be remote-referenced.

mppf commented 4 years ago

@gbtitus - yes I have considered that, based on https://github.com/chapel-lang/chapel/issues/8792#issuecomment-447048125 (I brought up much earlier in this issue in a comment talking about how split/deferred initialization is different from the noinit we imagined).

At the very least, these split-initialization strategies have the property that the compiler identifies a (possibly later) initialization point. For memory registration and arrays, we would just need to register the memory after that initialization point. This could even be done in the array initializer.

It is the case that there might be scenarios where an array initializer might not be parallel (e.g. creating an array from a for loop) and so as a result we will have different first-touch behavior from today. However these seem addressable as well (e.g. the code going from a serial iterator to an array could touch the memory first, if we wanted).

We could make this work with deferred initialization too, but we'd have to ensure the registration actually did occur before the memory could possibly be remote-referenced.

These designs don't allow access to a variable that isn't initialized (in contrast to the original noinit designs which do). As a result I don't think it'd be possible to remote-reference the memory before it is initialized and registered (since it can't be referenced at all until it is initialized).

gbtitus commented 4 years ago

... there might be scenarios where an array initializer might not be parallel (e.g. creating an array from a for loop) and so as a result we will have different first-touch behavior from today. ...

We might be able to address that via documentation: "On NUMA architectures when not using the "numa" locale model, initialization not only assigns initial value(s), it also sets NUMA locality. ...".

vasslitvinov commented 4 years ago

Background: I like the general principle of allowing the compiler to replace (default initialization followed by an assignment) with (no initialization followed by an init=). This should be allowed when the LHS is not referenced between the initialization and the assignment, or whatever criterion we are considering for split initialization. I would like the compiler to have the freedom to do this replacement (when allowed) or not, although I do not insist on this freedom. The user should have a way to opt-out of such replacements for a particular variable or perhaps for a particular type.

I like the "implicit" option because I see it as an instance of the above principle.

This principle also justifies applying split-initialization in non-trivial codes, as discussed by @cassella above. What if the declaration and the deferred initialization are separated by hundreds of lines of code? What if split-initializability differs in the true vs. false branche of a conditional? With the "at the discretion of the compiler" principle, the programmer would be prepared that the default initialization may or may not occur at the variable declaration statement. They would not try to analyze complex code precisely, trying to predict which way it will go.

I am also sympathetic with the = defer + init= syntax because it offers precise expression of the programmer's intention. I am perfectly OK if this feature is not readily available for beginners. However, I expect the syntactic overhead to be annoying, and I convinced myself that the "implicit" option will work just fine and will feel nicer to use.

vasslitvinov commented 4 years ago

Here are a couple of deferred-initialization scenarios that I found very desirable to support.

Initializing array elements in a forall statement. This is better than a forall expression because the user may want to co-locate initializing an array with other computations and/or initializing other array(s):

var A1, A2: [D] ... = defer;  // explicit syntax for clarity

forall i in D {
  compute stuff;
  A1[i] init= ...;
  A2[i] init= ...;
}

Initializing multiple fields with a single helper function. This is inspired by one of our tests where we had to inline the helper functions or make other adjustments when tightening up initializers (or, back then, constructors):

class C {
  var f1, f2, f3, g4, g5, g6...; // a bunch of fields, with types or not
}

proc C.init() {
  initFs(f1,f2,f3);

  if ... then
    initGs_1(g4,g5,g6);
  else
    initGs_2(g4,g5,g6);
  ...
}

mppf commented 4 years ago

Here are a couple of deferred-initialization scenarios that I found very desirable to support.

I'm not sure we need to support these immediately.

Initializing array elements in a forall statement. This is better than a forall expression because the user may want to co-locate initializing an array with other computations and/or initializing other array(s):

This one is challenging because as I was discussing with @gbtitus above, the compiler/runtime needs to be able to identify a specific moment after which the array is initialized. For this reason, I don't think = noinit and A[i] init= b are sufficient. The demands of this case seems sufficiently different - I think we should consider it as a separate problem to add feature(s) for in the future. (#14273 is one proposal along these lines that goes in a very different direction from anything here).

Initializing multiple fields with a single helper function. This is inspired by one of our tests where we had to inline the helper functions or make other adjustments when tightening up initializers (or, back then, constructors):

If all the types were declared, AFAIK this one can be handled with the out intent. Perhaps we could consider allowing the compiler to infer the type for out intent formals in a manner similar to inferring the return type. But, like the array case, this seems to me to be a different enough case that we need not tie it to the current design question.

vasslitvinov commented 4 years ago

As a data point, this test:

test/multilocale/diten/needMultiLocales/DijkstraTermination.chpl

has almost 30 occurrences, in its 180 LOC, of accesses to fields of endCount and wakeup that need !. With deferred initialization, they would be of non-nilable types and would not need the !.

bradcray commented 4 years ago

While thinking about explicit vs. implicit deferred initialization over Thanksgiving (I know, how lame), I started getting worried about the following pattern in an implicit deferral world:

var myR: R;
var firstTime = true;

for i in 1..n do
  if (firstTime && someTest()) {
    myR = new R();
    firstTime = false;
  } else {
    ...
  }

Specifically, I was getting myself worried that the compiler couldn't/shouldn't be expected to be smart enough to know that the assignment to myR in the loop could be a legal initialization, so was then worried that it would complain that we hadn't initialized in both branches. Re-reading some of the examples above, though, I'm now thinking that it would actually default initialize myR and then think of the assignment to it in the loop as an assignment. So I think I was worried about nothing (but didn't want to lose the pattern).

mppf commented 4 years ago

@bradcray - right, the compiler wouldn't allow the initialization to be deferred to inside of a loop (even a param loop, actually) and as a result it would view var myR: R as requiring a default initialization (it would do it if possible and error if not).

bradcray commented 4 years ago

Random question that occurred to me today: What is the interaction between deferred initialization and config declarations? E.g., consider:

config const i: int;

i = 10;

Today, the assignment to i would be illegal because the config is a const that would either be initialized to the command-line override's value or to 0 otherwise (as the default for int). In a deferred initialization world, you could imagine that it would either be initialized to the command-line override's value or to 10 otherwise. But somehow that feels more confusing and surprising to me (since an apparently executable statement never gets executed).

Config vars have a similar issue:

config var i: int;

i = 10;

Here, on master, i would be initialized to either the command-line override or to 0, and then reassigned to 10 (since it's a variable). If deferred initialization was supported on configs, though and the assignment of 10 was considered its initializer, the command-line override would not get overridden.

These instinctively make me think that perhaps deferred initialization should not be supported for configs (or that maybe we should go back to requiring configs to have initialization expressions).

The thing that triggered this for me was Michael's start on looking into deferred param and type initialization, which also support config variants which we used to require until we decided that it was too draconian (that relying on the default value should be reasonable).

mppf commented 4 years ago

These instinctively make me think that perhaps deferred initialization should not be supported for configs (or that maybe we should go back to requiring configs to have initialization expressions).

I don't think it makes sense for config variables/consts/params/whatever to be split initialized. I believe my branch simply disables split initialization for them but I can check.

bradcray commented 4 years ago

OK, I'll be curious. And whether "disables split initialization" means "behaves as it traditionally has" or something different.

mppf commented 4 years ago

OK, I'll be curious. And whether "disables split initialization" means "behaves as it traditionally has" or something different.

The PR does not change config variables or any other global variable. (Global variables cannot be subject to the analysis described b/c the compiler doesn't generally expect to know which globals are used/modified in which functions called).

However thinking about this made me realize there is an issue with inner functions in the implementation, so I'll fix that.

mppf commented 4 years ago

I've noticed four problems with PR #14564:

It doesn't correctly avoid init= when the RHS expression is a call returning by value
It does not correctly handle early returns in a split init situation (deinit is called when it should not be)
it does not enable split-init within a try or try! block
the order of deinit is not necessarily appropriate any longer

1 and 2

I view the 1st and 2nd of these as implementation issues that I am working on addressing.

3

The third though would require spec updates as well, but it seems "obvious" to me that we should support it. try and try! are effectively a kind of decorator for a block. So I think that this for example should work:

  var r;
  try {
    r = new R();
  }

An error thrown in this situation is similar to an early return and the compiler can know that the variable hasn't been initialized in that case.

But, if there are catch blocks, it becomes possible for an error to be thrown and leave the variable uninitialized. For example

  var r;
  try {
    r = returnsR(throw=true); // throws an error
  } catch {
    // r not initialized if returnsR threw an error
  }

Now we might imagine that we could support catch blocks similarly to conditionals, but the problem is that the try block can be escaped out of at any time (when an error is thrown) and as a result by nature of being in a catch block doesn't indicate whether the variable was initialized. For example:

  var r;
  try {
    if something then throw new Error();
    r = new R();
    if somethingElse then throw new Error();
  } catch {
    // r might be initialized or it might not be!
  }

The compiler could take the approach of ensuring that, in the event of an error caught, all variables being initialized in the try section are deinitialized. This is possible with analysis in the try block similar to what we do today. In that case, we could insist that each catch clause also initialize a variable. However I'm not so sure such a feature would actually be useful.

4

For the fourth issue - historically, we have deinitialized variables in the reverse order of declaration. But with split-initialization, the variable declaration order and the variable initialization order does not necessarily match:

  var a;
  var b;
  b = new R();
  a = new OtherRecordReferringTo(b);

Now de-initializing in reverse declaration order will lead to running b's deinit first, leaving a potentially referring to freed memory within its deinit call.

What's worse is that the split initialization allows initialization within conditionals, so that the initialization order might not be knowable at compile time:

  var a;
  var b;
  if option {
    b = new R();
    a = new R(b);
  } else {
    a = new R();
    b = new R(a);
  }

To solve this problem, I would propose that the initialization statements for split initialization must initialize the variables in the order that they are declared. I don't think that this has much impact on the expected use cases for split-initialization (where typically there is just one variable of interest).

mppf commented 4 years ago

Maybe in issue 4 we could only require initialization order match declaration order within conditionals. Generally speaking I think it will be fine for split-init to create a different initialization order than declaration order (once we update the compiler to deinit in reverse initialization order according to the initialization points).

vasslitvinov commented 4 years ago

For Issue 4 above, a related scenario is when the two variables are declared in different scopes. For example:

var a;
if option {
  var b = new R();
  a = new R(b);
} else {
  a = new R();
}

vasslitvinov commented 4 years ago

For 3, it is the easiest to disallow split-init within try blocks for now.

For 4, I agree that require the same initialization order in both branches of a conditional is the way to go.

While having the order of deinit match the order of declarations is appealing, it does not work in some split-init cases. So yes, let's go for deinit order matching the initialization order.