[Array design] `Array_tabulate` for mutable arrays

matthewhammer commented 4 years ago

Problem

For #865, I'm trying to implement an operation for growable buffers that produces a (fixed-sized) mutable array of the current contents.

In other words, the following type should be inhabited:

mutArrayOfBuffer : Buf<X> -> [var X]

Under the hood, I need this type to be implementable:

[var ?X] -> [var X]

Where the output array may be shorter than the input, and where it contains the (possibly-empty) non-null prefix of the input array.

I think we can agree that this is a reasonable operation; regardless of whether we'd use it, it should be possible to implement it in Motoko.

Unfortunately, it currently is not inhabited, at least not in any way I can see now.

For instance, simply consider the case where the non-null prefix is empty. Now, there are no initialization elements of type X to supply to Array_init, currently the lone way to allocate a dynamically-sized mutable array.

Proposed solution

To fix that, I suggest adding a tabulate primitive for mutable arrays. It's the same idea as the current thing for immutable arrays.

This operation would permit this arrow type to be implemented when I need it in the buffers work. There, I need not search for this non-null prefix's end, and instead I have an invariant about it, and can give this offset to Array_mutTabulate (or whatever we call it).

What do others think?

matthewhammer commented 4 years ago

cc @paulyoung --- I recall you asking about this, or something similar, in the past, IIRC.

matthewhammer commented 4 years ago

cc @nomeata @crusso @rossberg

paulyoung commented 4 years ago

Under the hood, I need this type to be implementable:

[var ?X] -> [var X]

I think this is equivalent to catMaybes in Haskell.

paulyoung commented 4 years ago

consider the case where the non-null prefix is empty. Now, there are no initialization elements of type X to supply to Array_init.

Maybe I'm misunderstanding what "non-null prefix" means, but if it's empty, isn't the resulting array also empty?

It seems trivial to special case that as var xs : [var X]; then (if empty) xs := [].

matthewhammer commented 4 years ago

@paulyoung Good point. It's possible to do this with an additional pass over the array and a special case for the empty array, but it's not very nice.

Here's another example where I get stuck in the same way, with the same ugly workaround:

mutArrayOfArray<X> : [X] -> [var X]

I'd rather just use Array_mutTabulate (or whatever it's called eventually).

paulyoung commented 4 years ago

I'd like to understand this some more. Could you post some code that doesn't type check because of this issue?

paulyoung commented 4 years ago

mutArrayOfArray<X> : [X] -> [var X] has the same type as thaw. Does this approach not work?

https://github.com/dfinity-lab/motoko/blob/914d2cb975e8f1554ff006bcda79c1e2f9a33952/stdlib/array.mo#L110-L121

paulyoung commented 4 years ago

I agree that this is less than ideal.

paulyoung commented 4 years ago

A mutable version of Array_tabulate would definitely have been preferable here.

matthewhammer commented 4 years ago

That's the issue exactly; thanks for sharing the code. I'm trying to avoid re-implementing a bunch of different variants of this awkward idiom (used here in thaw, necessarily) each time I need to construct a mutable array from an existing random-access thing.

nomeata commented 4 years ago

When we added Array_tabulate I think I asked if we want this also for mutable ones, and I believe Andreas argued against it, since you can implement it using Array_init

func Array_tabulate_mut<T>(len : Nat,  gen : Nat -> T) : [var T] {
  if (len == 0) { return [] };
  let xs = Array_init(len, gen 0);
  for (i in range(1,len)) {
    xs[i] := gen i;
  };
  return xs;
};

(Typos and off-by-one-errors left for you).

Yes, a primitive version would be a bit more efficient (less writes to the array), but that probably holds true for any function or data structure written in Motoko…

(I am not personally opposed to Array_tabulate_mut, especially as in the backed the code would be identical, I am just relaying what I think is Andreas’ reasoning here.)

rossberg commented 4 years ago

I guess my criterion for adding a builtin (as opposed to simple library functions) would be that you cannot do it within the language, even in principle, without significant overhead. Not sure that is the case here, as Joachim's sketch shows. Assuming the compiler handles for properly (or it uses while), this doesn't seem worse than what a builtin would do. Or am I missing something?

Aside: @matthewhammer, not sure you really want to implement Buf as an array of options, that seems unnecessarily inefficient (tons of extra allocations!). I think it makes more sense to also require a default element for the constructor. That would also happen to evade the problem.

nomeata commented 4 years ago

Assuming the compiler handles for properly (or it uses while), this doesn't seem worse than what a builtin would do. Or am I missing something?

yes: primitive Array_tabulate_mut would write each array cell exactly once, wheras this code writes each cell twice: First in Array_make, and then again the loop. Whether that is significant I don’t know.

I think it makes more sense to also require a default element for the constructor.

That makes it a bit harder to use in polymophic contexts. I guess you would not want us to add unsafe operations, for use by the standard library (e.g. a dummy : a element inhabiting all types).

You can also put the option around the array: Until you have seen an element of type X, you don’t even allocate the array, and only when you actually see the first one you use this as the default. Has the disadvantage, though, that this element is going to kept alive even when the user has removed it from the “visible” part of the data structure.

rossberg commented 4 years ago

@nomeata:

primitive Array_tabulate_mut would write each array cell exactly once, wheras this code writes each cell twice

Doesn't tabulate have to do the same? You have to pre-initialise the array with something, otherwise hell breaks loose if you run into GC while tabulating.

That makes it a bit harder to use in polymophic contexts.

No harder than passing a custom hash or ordering function to the constructor of other data polymorphic structures. If anything, the nuisance is to come up with good dummy values for certain types, but that is independent of polymorphism.

nomeata commented 4 years ago

Doesn't tabulate have to do the same? You have to pre-initialise the array with something, otherwise hell breaks loose if you run into GC while tabulating.

The backend knows when GC is running, so no problem (at least not at the moment). But good point looking forward when we would have GC at more arbitrary points.

(I guess you could somehow encode in the header how far the array has been initialized and update that as you fill it, but that would amount to the same number of writes, just a bit more localized in the sense that each call is touched only once. But talking about this level of optimization at this point is gettting silly.)

Not convinced that you really never want to create a buffer in a polymorphic situation where you don’t know (yet) whether the type is inhabitated or not (just like we allow [] : [t]), but probably too specialized to worry about now.

rossberg commented 4 years ago

@nomeata: I think something like [] : [t] is only useful with some form of (parametric or subtyping) polymorphism. But Buf is invariant and a value cannot be generic in our language, so you cannot use it like that.

That said, it is still a minor nuisance having to provide a dummy (C++ and friends require it, too, but they have hacks like default constructors etc).

As a variation on your suggestion above, maybe the most efficient implementation is

class Buf<T>(capacity : Nat) {
  assert (capacity > 0);
  var elems : [var T] = [var];  // initially empty
  var count = 0;
  public func add(x : T) {
    if (count == elems.len()) {
      let size = if (count == 0) capacity else 2*elems.len();
      let elems2 = Array.init(size, x);
      for (i in elems.keys()) elems2[i] := elems[i];
      elems := elems2;
    };
    elems[count] := x;
    count += 1;
  }
  ...
}

This needs no option or extra check for every add.

nomeata commented 4 years ago

Good point no need to wrap the [] in an opt if we can just use the empty array.

It still has the problem that it keeps x alive. Not a problem for the grow-only Buf<T>, but it would be a problem for RamBuf<T>.

rossberg commented 4 years ago

I'm not sure about the notion of RamBuf. It's semantics sounds rather like a SparseArray to me. Maybe we should keep that separate?

matthewhammer commented 4 years ago

@nomeata says above:

When we added Array_tabulate I think I asked if we want this also for mutable ones, and I believe Andreas argued against it, since you can implement it using Array_init

Yes, excellent point. This also occurred to me too, but only after making those more dramatic claims above.

I'll use this to solve the impasse, for now.

@rossberg says above:

I think it makes more sense to also require a default element for the constructor. That would also happen to evade the problem.

I disagree, but If you feel strongly, that's fine, I'll follow your lead.

Doesn't tabulate have to do the same? You have to pre-initialise the array with something, otherwise hell breaks loose if you run into GC while tabulating.

I think there are many solutions. For instance, the array's GC header could have a "ready" flag/bit to signal that it's been initialized. It would not traverse an unitialized array. There are probably many other ways to address this than force the programmer to cough up default values of (potentially abstract) types, but the default value solution is certainly the easiest to do now. Will do.

I'm not sure about the notion of RamBuf. It's semantics sounds rather like a SparseArray to me. Maybe we should keep that separate?

RamBuf makes no claims of sparseness, it merely adds random access to Joachim's original design, which lacked these random access (read/write) operations. I'm happy to just merge this all together and forget about Buf vs RamBuf, but the two notions each seem useful to me, and seem self-justified.

matthewhammer commented 4 years ago

So, just to summarize, here's my tentative synthesis of a plan, from the discussion above; comments welcome, of course:

Implement Array_tabulate_mut in the stdlib, but
Require an initial value, and avoid arrays of options
Unclear: Merge RamBuf and Buf, or forget about one of them? (If so, which?)

matthewhammer commented 4 years ago

IMO, we can close this now and continue conversing in #872

rossberg commented 4 years ago

@matthewhammer, see my last comment above, which shows that you can both avoid options and not require a default.

As for RamBuf, if it's just about random access read then I'd just add that to Buf.

dfinity / motoko

[Array design] `Array_tabulate` for mutable arrays #871

Problem

Proposed solution