feiwang3311 / Lantern

BSD 3-Clause "New" or "Revised" License
167 stars 15 forks source link

Implement basic element/slice indexing #15

Open dan-zheng opened 5 years ago

dan-zheng commented 5 years ago

Implement basic, non-strided indexing as described here.

Tensor.apply should be adapted for these indexing strategies (rather than the current scalar indexing behavior). These indexing strategies add expressivity: they enable minibatching and interesting recursive functions:

def print(separator: String = ", ") = {
  if (isScalar) printf("%f", this.data(0))
  else {
    // Recurse over element tensors.
    printf("[")
    for (i <- 0 until this.shape(0): Rep[Range]) {
      this(i).print(separator)
      printf(separator)
    }
    printf("]")
  }
}

Both element tensors and slices should probably be cost-free views into the data of their parent tensor. Otherwise, indexing large tensors (for minibatching) will be highly inefficient. We can implement copy-on-write semantics.

feiwang3311 commented 5 years ago

I am working on this but there is some problem with range selection. If I have apply(i: Rep[Int], j: Rep[Int]) to select a range, the (j-i+1) is actually a Rep[Int], which conflicts with the design that dimensions are not Rep type.

However, this is related to a more general problem, which is the fact that Lantern cannot yet support unknown dimensions (normally used for batch, which is variable). I think if we add unknown dimensions, it will be Rep[Int] (for batch dimension).

What do you guys think? @dan-zheng @TiarkRompf @GSAir

dan-zheng commented 5 years ago

Could you explain why slice indexing uses apply(i: Rep[Int], j: Rep[Int]) rather than apply(i: Int, j: Int)? Or apply(range: Range), where range.step == 1?

I'm also slightly concerned about the readability of slice indexing, especially when mixed with element tensor indexing.

val tensor = Tensor.ones(4, 4, 4, 4)

tensor(0, 3)(0)(1, 2)(0)
// This is a bit sore on the eyes.
// Visually, `apply(i, j)` doesn't look like a "slice" to me.
// `apply(range)` might be more interpretable.
// tensor(0 until 3)(0)(1 until 2)(0)

Lack of support for unknown dimensions is a real issue. I don't have great ideas for solving that at the moment.

Edit: actually, is this really a problem? When reading a dataset, use a count variable to keep track of the leading dimension, and use it to initialize the final tensor.

feiwang3311 commented 5 years ago

Let's first be sure of the goal. Rep typed range/slicing allows slicing at runtime, but static range/slicing is only slicing at stage time. What is the goal? Is it slicing with known indexes at stage time, or slicing with indexes unknown at stage time?

TiarkRompf commented 5 years ago

tensor(0 until 3)(0)(1 until 2)(0) looks nice.

What exactly is the key use case for unknown dimensions? Is it because we want to increase batch size during training?

How does XLA deal with it, I think they also require static fixed shapes?

In general I think switching shapes from Array[Int] to Array[Rep[Int]] is defensible but we should be clear about why we need it and make sure it doesn't cause any performance regressions.

dan-zheng commented 5 years ago

@feiwang3311 That's a good question. I'm not sure which is better, but stage-time indexing seems fine to me.

The more important factor regarding performance is to ensure that element tensors and slices have copy-on-write semantics, e.g. they reuse the data of their parent tensor until they're first written to. Otherwise, indexing will cause full element/slice copying every time, which is expensive.

feiwang3311 commented 5 years ago

Runtime slicing is useful for RNN, whose input is 3D tensor of shape (timeStep, batch, dimension). At each cycle i, we process input(i) with the last hidden state, and produce the next hidden state.

Runtime ranging may not be useful (I don't know), I choose to skip it for now.

Staging time slicing and ranging may be useful (I have not encountered any. Normally for minibatching, we get a mini-batch of tensor directly, without the need to slice a minibatch out)

I am not sure how to implement copy-on-write (some guidance link is helpful). I think for runtime slicing (like the RNN case), no copy at all makes sense.

TiarkRompf commented 5 years ago

Let's make sure that anything we do has a concrete use case in one of the benchmark apps. If it's not on the critical path for at least one of them then:

  1. it's not the best use of our time right now
  2. we may not sufficiently understand all the requirements

For example, a good COW scheme takes significant work to get right (in particular, to not slow down read-only uses) so we should approach this only when necessary.

dan-zheng commented 5 years ago

I totally agree that implementing COW is not a best use of our time. For minibatching, slice tensors are not modified, so COW is unnecessary.

(The primary purpose of COW is to enable "value semantics for tensors", e.g. tensor(i) is a new Tensor value that can be modified without affecting tensor.)

If we decide it is a good use of time to implement indexing, we can implicitly let element/slice tensors have "reference semantics" and share their parents' data. The only consequence is that writing to element/slice tensors will change the data of their parents, which may be counterintuitive. The actual indexing code will simply perform pointer arithmetic, just like the current slice method.

TiarkRompf commented 5 years ago

Makes sense. Again, let's decide based on what we encounter in the various models.