Implement backend-defined data allocation and transfer ops.

feiwang3311 / Lantern

BSD 3-Clause "New" or "Revised" License

168 stars 15 forks source link

Implement backend-defined data allocation and transfer ops. #22

Closed dan-zheng closed 6 years ago

dan-zheng commented 6 years ago

Use backend.mallocArray to allocate tensor data.
Remove TensorNew, TensorApply, TensorUpdate expressions.
Rename various traits/classes to be more accurate.
- Dsl -> DslOps (the trait does define ops)
- TensorExp -> TensorDsl (the trait doesn't define expressions)
- BackendNative -> BackendCPU
Make backend class singletons.
Implement toCPU and toGPU transfer ops.

TODO:

Make more tensor ops be backend-defined (e.g. addition).
Add some sort of backend-defined free operation.
Consider merging DslDriverXX and LanternDriverXX for simplicity.

feiwang3311 commented 6 years ago

Thanks Dan for the progress.

On Tue, Oct 9, 2018 at 1:38 PM Dan Zheng notifications@github.com wrote:

@dan-zheng commented on this pull request.

In src/main/scala/lantern/dslapi.scala https://github.com/feiwang3311/Lantern/pull/22#discussion_r223796893:

@@ -206,6 +169,21 @@ trait DslExp extends Dsl } }

+trait TensorOps extends DslOps {

object NewGPUArray {

// Allocate an array of the specified size on the GPU.

def apply[T: Manifest](scalarCount: Rep[Int]): Rep[Array[T]] = gpu_array_new(scalarCount)

}

def gpu_array_new[T: Manifest](scalarCount: Rep[Int]): Rep[Array[T]] = ???

I see.

The problem is NewGPUArray is called within TensorDsl, which LanternDriverC mixes in (by way of LanternDriverC <- LanternDriver <- TensorDslExp <- TensorDsl).

We could make a TensorDslCuda trait that extends TensorDsl and mixes in "GPUAllocOps". BackendCublas/BackendCudnn would then be defined in TensorDslCuda. I can do this sometime.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/feiwang3311/Lantern/pull/22#discussion_r223796893, or mute the thread https://github.com/notifications/unsubscribe-auth/AMOeErpZnRCOxUYX2tXe8UqhaE-RFBSaks5ujN8GgaJpZM4XTvMZ .

dan-zheng commented 6 years ago

This works!

testGPU("matrix-matrix-dot-transfer") {
  val test = new LanternDriverCublas[String, Unit] {
    @virtualize
    def snippet(x: Rep[String]): Rep[Unit] = {
      backend = BackendCPU()
      val c1 = Tensor.ones(4, 4)
      val c2 = Tensor.ones(4, 4)
      val g1 = c1.toGPU()
      val g2 = c2.toGPU()

      backend = BackendCublas()
      val g3 = g1.dot(g2) // calls `cublasSgemm`
      val c3 = g3.toCPU()

      backend = BackendCPU()
      val expected = Tensor.fill(4, 4, 4)
      Tensor.assertEqual(c3, expected)
      c3.print()
    }
  }
  runTest(test)
}

$ nvcc -std=c++11 -g lantern-snippet.cu -o snippet -lcublas
$ ./snippet dummy
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000

It doesn't quite work for non-square matrices with non-one elements though, because cublasSgemm is performing multiplication in a different order than our CPU implementation. I'll start to investigate.

feiwang3311 commented 6 years ago

Good news! at least it is partially working :)

TiarkRompf commented 6 years ago

Awesome!

feiwang3311 / Lantern

Implement backend-defined data allocation and transfer ops. #22

@dan-zheng commented on this pull request.